apex is a mixed training library from Nvidia. I have been using it since I got an RTX3080TI GPU. A few days ago, I started to use RegNetY-32GF (I just used RegNetY models smaller than16GF previously). After a accidental break, I tried to resume the training but it reported:
Traceback (most recent call last): File "train.py", line 353, in <module> train(args, train_loader, eval_loader) File "train.py", line 220, in train scaled_loss.backward() File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 401, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 191, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue. import torch torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.benchmark = True torch.backends.cudnn.deterministic = False torch.backends.cudnn.allow_tf32 = True data = torch.randn([28, 3712, 10, 10], dtype=torch.half, device='cuda', requires_grad=True) net = torch.nn.Conv2d(3712, 3712, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1) net = net.cuda().half() out = net(data) out.backward(torch.randn_like(out)) torch.cuda.synchronize() ConvolutionParams data_type = CUDNN_DATA_HALF padding = [0, 0, 0] stride = [1, 1, 0] dilation = [1, 1, 0] groups = 1 deterministic = false allow_tf32 = true input: TensorDescriptor 0x55d2a620ff60 type = CUDNN_DATA_HALF nbDims = 4 dimA = 28, 3712, 10, 10, strideA = 371200, 100, 10, 1, output: TensorDescriptor 0x55d2a6215310 type = CUDNN_DATA_HALF nbDims = 4 dimA = 28, 3712, 10, 10, strideA = 371200, 100, 10, 1, weight: FilterDescriptor 0x7fd9e806f1e0 type = CUDNN_DATA_HALF tensor_format = CUDNN_TENSOR_NCHW nbDims = 4 dimA = 3712, 3712, 1, 1, Pointer addresses: input: 0x7fd73fde3a00 output: 0x7fd746abb600 weight: 0x7fd761b5de00
This error looks quite scary so the first thought that comes to my mind is that: the training environment has crashed! Then I download the newest GPU driver and get the most updated docker container for PyTorch. But the error continues.
As a second thought, I began to suspect that apex couldn’t handle too big models…(what’s in my mind?) therefore I modified my code to use “torch.cuda.amp” instead of “apex.amp” as the document. Fortunately, the error disappears but I have to use a smaller batch size. Looks like the “torch.cuda.amp” couldn’t reduce enough GPU memory as “apex.amp”.
However, the story doesn’t end here. Just before writing this article, I just used a smaller batch size for my old code with “torch.cuda.amp”, and it works well…
All in all, the terrible error above is simply caused by insufficient GPU memory.