All the code is here.

The baseline of training balanced data of AudioSet is 0.27 mAP. Using TimeMasking and FrequentMasking could slightly push it to 0.28 mAP.

I tried mixup of raw sounds like AST but it didn’t improve the mAP totally (the reason is still a myth for me). But, the mixup of fbank filters could push metric to 0.293 mAP.

Until then, the fbank filter will be resized to (384, 384) for model deit_distilled. After I recovered the size of fbank filter to (128, 998), it reached 0.323 mAP.

The most recent (hope it’s not the last) change is copied wholly from AST: use the pretrained parameters of Conv2D from deit_distilled but change the stride size — also expand the position embeddings since the sequence length has changed. The result is 0.333 mAP.

It is worth noting that this is the first time I feel the power of pretrained model by my hand. If I re-initialized parameters of position embedings instead of “bilinear” interpolating it, the result will be far away from 0.333 mAP. Also if I used new initialized parameters of the Conv2D (first layer for Vision Transformer), the result is as bad as before.

I will take care of whehter pretrained model also works well for unbalanced data of AduioSet