I have been using RegNetY in DongNiao for almost two years. Previously I was just using small models such as RegNetY-8G. But after having a computer with RTX-3080-TI, I started to use the biggest one in the original paper — RegNetY-32G.
RegNeyY-32G model costs a lot of time for training so I would use mixed-precision in the process. However, after using “float16”, the training program always crashes with the error of overflow:
... Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0237e-320 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.012e-320 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.06e-321 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.53e-321 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.265e-321 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.3e-322 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.16e-322 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6e-322 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8e-323 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4e-323 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2e-323 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1e-323 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5e-324 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
Firstly, I suspected that the bigger model couldn’t hold a large learning rate (I used 8.0 for a long time) with “float16” training. So I reduced the learning rate to just 1e-1. The model stopped to report overflow error but the loss couldn’t converge and just stay constantly at about 9.
Then I have no choice but to adjust the parameters step by step to find a set of hyper-parameters for converging. Finally, I found the reason: the enabling of Squeeze-and-Excitation block in RegNetY makes the model harder to converge. The exponential operation in the Sigmoid function might be the cause since “float16” can’t always process exponential change properly.
The solution is simple: just disable the Squeeze-and-Excitation block in RegNetY:
cfg.MODEL.TYPE = "regnet" # RegNetY-32.0GF cfg.REGNET.DEPTH = 20 cfg.REGNET.SE_ON = False cfg.REGNET.W0 = 232 cfg.REGNET.WA = 115.89 cfg.REGNET.WM = 2.53 cfg.REGNET.GROUP_W = 232 cfg.BN.NUM_GROUPS = 4 cfg.MODEL.NUM_CLASSES = config["num_classes"] net = model_builder.build_model()
I may need to use Hard Sigmoid in the Squeeze-and-Excitation block for the experiment in the future.