Debugging the problem of ‘nan’ value in training

Previously, I was using CUB-200 dataset to train my object detection model. But after I used CUB-200-2011 dataset instead, the training loss became ‘nan’.

I tried to reduce the learning rate, change optimizer from SGD to Adam, and use different types of initializer for parameters. None of these solved the problem. Then I realized it would be a hard job to find the cause of the problem. Thus I began to print the value of ‘loss’, then the values of ‘loss_location’ and ‘loss_confidence’. Finally, I noticed that ‘loss_location’ firstly became ‘nan’ because of the value of \hat{g}_j^w in the equation below (from paper) is ‘nan’:

‘loss_location’ from paper ‘SSD: Single Shot MultiBox Detector’

After checked the implementation in the ‘layers/’ of code:

I realized the (matched[:, 2:] – matched[:, :2]) has got a negative value which never happend when using CUB-200 dataset.

Now it’s time to carefully check the data pipeline for CUB-200-2011 dataset. I reviewed the bounding box file line by line and found out that the format of it is not (Xmin, Ymin, Xmax, Ymax), but (Xmin, Ymin, Width, Height)! Let’s show the images for an incorrect bounding box and correct one:

Parse bounding box by format (Xmin, Ymin, Xmax, Ymax) which is wrong

Parse bounding box by format (Xmin, Ymin, Width, Height) which is correct

After changed the parsing method for the bounding boxes of CUB-200-2011 dataset, my training process runs successfully at last.

The lesson I learned from this problem is that dataset should be seriously reviewed before using.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.