MobileNetV2 – Robin on Linux

The uneasy way to implement SSDLite by myself

SSDLite is a variant of Single Shot Multi-box Detection. It uses MobileNetV2 instead of VGG as backbone. Thus it can make detection extremely fast. I was trying to implement SSDLite from the code base of ssd.pytorch. Although it’s not a easy work, I finally learn a lot from the entire process.
First, I just replace VGG with MobileNetV2 in the code. However the loss will stop to reduce after a while of training. Without knowing the reason, I have to compare my code with another open source project ssds.pytorch, to try to find the cause.
Very soon, I noticed that unlike VGG backbone, which built detection framework from 38×38 feature map, the MobileNetV2 use 19×19 feature map as its first detection layer.

“For MobileNetV1, we follow the setup in [33]. For MobileNetV2, the first layer of SSDLite is attached to the expansion of layer 15 (with output stride of 16).”
From: MobileNetV2: Inverted Residuals and Linear Bottlenecks

After changed my code as the description of this paper, the loss still couldn’t reduce in training.
In the next three weeks, I tried a lot of methods: change the aspect ratios, use SGDR to replace SGD, change the number of default boxes, even modifying the structure of neural network to be identical to to ssds.pytorch. But none of them solves the problem. There is another weird phenomenon: when I run prediction on my model, it usually gives random output for detection.
Just until last week, I noticed that my model size is about 10MB but the ssds.pytorch’s is 18MB. Why do they have different model size if their models is exactly the same? Through this clue, I eventually get the cause: a large part of my model hasn’t been back-propagated at all!
My old code only implements the forward() of MobileNetV2 which is not enough for the whole model. Therefore I add nn.ModuleList() to build model from a list of layers, as this patch:

self.backbone = nn.ModuleList(mobilenet.MobileNetV2(num_classes=num_classes, width_mult=1.0).features)

Only the nn.ModuleList() will take all layers into back-propagation process and keep them as model weights. Otherwise, the weights will be randomly init and just use for forwarding — that’s why I get random output before.
I think I should be more carefully when adding FPN into my model in the future.