CLIP is already a three years old paper but its simple design and significant performance still attracted me. After one week of programming and debugging, I finished v0.1-version of my tiny CLIP. It uses ConvNextV2 Nano and some parts of nanoGPT so both encoders will keep parameters of about 35millons.

The training dataset is CC3M downloaded by using the tool from img2dataset. The actual number of images is 2.3 million (might be my awful network environment). For the testing dataset, I use the 50000 val images of ImageNet1K.

I split the CC3M into 90% training and 10% validating. Just after one night of training (the electricity fee is much cheaper at night), the result seems too good to be true:

[Eval] loss: 0.2333 accuracy: 0.9257
[003 : 131000] loss: 0.5992 accu: 0.8281 lr: 1.0000e-06 time: 642.28
[004 : 132000] loss: 0.5567 accu: 0.7969 lr: 1.0000e-06 time: 198.91
[004 : 133000] loss: 0.4493 accu: 0.8750 lr: 1.0000e-06 time: 198.52
[004 : 134000] loss: 0.4729 accu: 0.8281 lr: 1.0000e-06 time: 199.15
[004 : 135000] loss: 0.5102 accu: 0.8281 lr: 1.0000e-06 time: 198.22

The accuracy in the 10% validating data is as high as 0.9257, which I guess is caused by this small dataset. The evaluation of 50000 val images of ImageNet1K shows that the top-5 accuracy is just 6.66%. This is far away even from the 2016 paper‘s 11.5% zero-shot accuracy.

Therefore, I will use CC12M in the next step.

There are also some questions I need to solve:

  1. The original paper of CLIP uses L2 normalization on multimodal embeddings. But in my test, the model will not converge with it.
  2. Adding the learnable temperature parameter will cause a training error, which requires a “retain_graph=True” argument for “backward()”
  3. When using “torch.compile()”, it will report a Triton error after the first epoch

Wish me good luck.