apex is a mixed training library from Nvidia. I have been using it since I got an RTX3080TI GPU. A few days ago, I started to use RegNetY-32GF (I just used RegNetY models smaller than16GF previously). After a accidental break, I tried to resume the training but it reported:
Traceback (most recent call last):
File "train.py", line 353, in <module>
train(args, train_loader, eval_loader)
File "train.py", line 220, in train
scaled_loss.backward()
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 401, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 191, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([28, 3712, 10, 10], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(3712, 3712, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
data_type = CUDNN_DATA_HALF
padding = [0, 0, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x55d2a620ff60
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 28, 3712, 10, 10,
strideA = 371200, 100, 10, 1,
output: TensorDescriptor 0x55d2a6215310
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 28, 3712, 10, 10,
strideA = 371200, 100, 10, 1,
weight: FilterDescriptor 0x7fd9e806f1e0
type = CUDNN_DATA_HALF
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 3712, 3712, 1, 1,
Pointer addresses:
input: 0x7fd73fde3a00
output: 0x7fd746abb600
weight: 0x7fd761b5de00
This error looks quite scary so the first thought that comes to my mind is that: the training environment has crashed! Then I download the newest GPU driver and get the most updated docker container for PyTorch. But the error continues.
As a second thought, I began to suspect that apex couldn’t handle too big models…(what’s in my mind?) therefore I modified my code to use “torch.cuda.amp” instead of “apex.amp” as the document. Fortunately, the error disappears but I have to use a smaller batch size. Looks like the “torch.cuda.amp” couldn’t reduce enough GPU memory as “apex.amp”.
However, the story doesn’t end here. Just before writing this article, I just used a smaller batch size for my old code with “torch.cuda.amp”, and it works well…
All in all, the terrible error above is simply caused by insufficient GPU memory.