Multimodal trials: solve the Masked Language problem about my tiny ALBEF implementation (episode 3)

I just wrote my implementation of ALBEF in my own way. But when evaluated with some masked sentences, it failed.

I am using this image:

When I asked “This is a chocolate <|mask|>”, it generated “This is a chocolate urn”. Quite strange

Then I asked “This is a <|mask|> cake, it generated “This is a iph cake”. Totally wrong.

After checking my implementation of the dataset, and training on a small part of CC3M, a week passed and I finally got the reason today: the tiktoken is a BPE tokenizer that will use sub-words as tokens and these sub-words severely hurt the model. For example, sub-words “urn” and “iph” appear too many times and the model would use them to replace the masked word in prediction.

By replacing tiktoken with BertTokenizerFast (from “transformers” package), the model correctly generates “This is a chocolate cake”.

Robin on Linux

Multimodal trials: solve the Masked Language problem about my tiny ALBEF implementation (episode 3)

Leave a Reply Cancel reply

Robin on Linux

Related Posts

Leave a Reply Cancel reply