After downloading the whole CC12M dataset from Huggingface, I wrote a tool to extract all of the image-text-pair files into one directory. But after extracting 17 million (17681010 exactly) files, the tool reported the error:
Exception: [Errno 28] No space left on device: '/home/robin/Downloads/cc12m/011647171.txt'
I checked the space and inodes in my ext4 filesystem, and seems they all have free capacity:
# "df -lh" Filesystem Size Used Avail Use% Mounted on tmpfs 3.2G 2.0M 3.2G 1% /run /dev/nvme0n1p1 916G 410G 461G 48% / tmpfs 16G 412K 16G 1% /dev/shm tmpfs 5.0M 4.0K 5.0M 1% /run/lock efivarfs 192K 124K 64K 66% /sys/firmware/efi/efivars /dev/sda2 96M 32M 65M 33% /boot/efi tmpfs 3.2G 56K 3.2G 1% /run/user/1000 /dev/sdb2 3.7T 2.5T 1.3T 67% /mnt
# "df -i" Filesystem Inodes IUsed IFree IUse% Mounted on tmpfs 4077303 1264 4076039 1% /run /dev/nvme0n1p1 61054976 9308680 51746296 16% / tmpfs 4077303 104 4077199 1% /dev/shm tmpfs 4077303 4 4077299 1% /run/lock efivarfs 0 0 0 - /sys/firmware/efi/efivars /dev/sda2 0 0 0 - /boot/efi tmpfs 815460 61 815399 1% /run/user/1000 /dev/sdb2 0 0 0 - /mnt
Then why the ext4 filesystem returned a “No space” error? The reason is explained here: https://blog.merovius.de/posts/2013-10-20-ext4-mysterious-no-space-left-on/.
After using “sudo dumpe2fs /dev/nvme0n1p1”, I got:
... Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: e10f88a7-1d8c-4c38-a796-6fa15bdf4e65 ...
Seems the hash algorithm of “index_dir” of my ext4 filesystem is already “half_md4” therefore my only choice is using “tea”. (The default “hash_algo” when you using “mke2fs” is “half_md4“)
But after I make the change:
sudo tune2fs -E "hash_alg=tea" /dev/nvme0n1p1
the error “No space left on device” still jumped out…
There are two solutions left:
- Rewrite my tool to generate flat big files with every file contains previous “small files”
- Replace ext4 with xfs (I will test this after I got another NVME SSD)