After downloading the whole CC12M dataset from Huggingface, I wrote a tool to extract all of the image-text-pair files into one directory. But after extracting 17 million (17681010 exactly) files, the tool reported the error:

Exception: [Errno 28] No space left on device: '/home/robin/Downloads/cc12m/011647171.txt'

I checked the space and inodes in my ext4 filesystem, and seems they all have free capacity:

# "df -lh"
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           3.2G  2.0M  3.2G   1% /run
/dev/nvme0n1p1  916G  410G  461G  48% /
tmpfs            16G  412K   16G   1% /dev/shm
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
efivarfs        192K  124K   64K  66% /sys/firmware/efi/efivars
/dev/sda2        96M   32M   65M  33% /boot/efi
tmpfs           3.2G   56K  3.2G   1% /run/user/1000
/dev/sdb2       3.7T  2.5T  1.3T  67% /mnt
# "df -i"
Filesystem       Inodes   IUsed    IFree IUse% Mounted on
tmpfs           4077303    1264  4076039    1% /run
/dev/nvme0n1p1 61054976 9308680 51746296   16% /
tmpfs           4077303     104  4077199    1% /dev/shm
tmpfs           4077303       4  4077299    1% /run/lock
efivarfs              0       0        0     - /sys/firmware/efi/efivars
/dev/sda2             0       0        0     - /boot/efi
tmpfs            815460      61   815399    1% /run/user/1000
/dev/sdb2             0       0        0     - /mnt

Then why the ext4 filesystem returned a “No space” error? The reason is explained here: https://blog.merovius.de/posts/2013-10-20-ext4-mysterious-no-space-left-on/.

After using “sudo dumpe2fs /dev/nvme0n1p1”, I got:

...
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      e10f88a7-1d8c-4c38-a796-6fa15bdf4e65
...

Seems the hash algorithm of “index_dir” of my ext4 filesystem is already “half_md4” therefore my only choice is using “tea”. (The default “hash_algo” when you using “mke2fs” is “half_md4“)

But after I make the change:

sudo tune2fs -E "hash_alg=tea" /dev/nvme0n1p1

the error “No space left on device” still jumped out…

There are two solutions left:

  1. Rewrite my tool to generate flat big files with every file contains previous “small files”
  2. Replace ext4 with xfs (I will test this after I got another NVME SSD)