I have been using RegNetY in DongNiao for almost two years. Previously I was just using small models such as RegNetY-8G. But after having a computer with RTX-3080-TI, I started to use the biggest one in the original paper — RegNetY-32G.

RegNeyY-32G model costs a lot of time for training so I would use mixed-precision in the process. However, after using “float16”, the training program always crashes with the error of overflow:

...
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.0237e-320                      
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.012e-320                       
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.06e-321                                                                                                                                 
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.53e-321                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.265e-321                       
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.3e-322                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.16e-322                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.6e-322                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5e-324                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0 

Firstly, I suspected that the bigger model couldn’t hold a large learning rate (I used 8.0 for a long time) with “float16” training. So I reduced the learning rate to just 1e-1. The model stopped to report overflow error but the loss couldn’t converge and just stay constantly at about 9.

Then I have no choice but to adjust the parameters step by step to find a set of hyper-parameters for converging. Finally, I found the reason: the enabling of Squeeze-and-Excitation block in RegNetY makes the model harder to converge. The exponential operation in the Sigmoid function might be the cause since “float16” can’t always process exponential change properly.

The solution is simple: just disable the Squeeze-and-Excitation block in RegNetY:

    cfg.MODEL.TYPE = "regnet"
    # RegNetY-32.0GF
    cfg.REGNET.DEPTH = 20
    cfg.REGNET.SE_ON = False
    cfg.REGNET.W0 = 232
    cfg.REGNET.WA = 115.89
    cfg.REGNET.WM = 2.53
    cfg.REGNET.GROUP_W = 232
    cfg.BN.NUM_GROUPS = 4
    cfg.MODEL.NUM_CLASSES = config["num_classes"]
    net = model_builder.build_model()

I may need to use Hard Sigmoid in the Squeeze-and-Excitation block for the experiment in the future.