I was trying to implement ALBEF by myself for practice. After finishing all the parts (Vision part, BERT part, including Masked Language Model), I trained the model on COCO-Captions/SBU-Captions/CC3M/CC12M dataset (actually more data than the original paper). But the result is quite weird. An old steam train was recognised as a building, and a few fish were recognised as statues.
To solve these weird mistakes, I reviewed the code many times and finally noticed a sentence in the paper:
data:image/s3,"s3://crabby-images/6b53f/6b53fd4f37c7ff404e2e440fccc7c3c2b2a36160" alt=""
Although it’s just a normal sentence in the paper, the augmentation could improve the ALBEF model significantly. After randomly cropping the 256×256 raw image to 224×224 and also using the RandAugment, I finally got a more stable and suitable model. Let’s see some examples:
data:image/s3,"s3://crabby-images/97bd6/97bd687c566bed717ff904e434b18e8a1f84a3be" alt=""
data:image/s3,"s3://crabby-images/3484b/3484b99f34eeefa4cc259baee639a6eeb7cd78c3" alt=""
data:image/s3,"s3://crabby-images/4a600/4a600e9aab1a708b3445abe053b5f3605d680e31" alt=""
Previously, the fish had been recognised as “shoes”, and the bedroom as “city”. They all become very well after augmentation.
But there are still some interesting bad cases:
data:image/s3,"s3://crabby-images/70db3/70db3c780210686e7867359e5206b2f0f9b7946d" alt=""
data:image/s3,"s3://crabby-images/02f4b/02f4bca4d965c48025e37fde7f7bb6991b3c432b" alt=""
Adding a prefix of “A picture of” could help the ALBEF model improve its recognition capability, or actually, there is a lot of text like “A picture of XXX” in the CC3M or CC12m dataset.
Anyhow, I finally implemented and trained a workable ALBEF model by myself, and my RTX-3080Ti card.