PyTorch

Hanging of PyTorch’s data loader

Long story short. I am trying to build a Siamese network for audio classification. For 50% possibility, the “dataset.py” will try to find a pair of audios in the same category but with different files (also, different category for another 50% possibility). But when the evaluating start, it will hang after fetching a few batches. The trace could be see:

Traceback (most recent call last):                                                                                                                                                                                                        
  File "/home/robin/song/birdclef/old_train.py", line 395, in <module>                                                
    train(args, train_loader, eval_loader)                                                                                                                                                                                                  
  File "/home/robin/song/birdclef/old_train.py", line 280, in train                                                   
    accuracy = evaluate(args, net, eval_loader)                                                                                                                                                                                             
  File "/home/robin/song/birdclef/old_train.py", line 91, in evaluate                                                 
    sounds1, sounds2, type_ids = next(batch_iterator)                                                                 
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()                                                                                                                                                                                                                
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
    idx, data = self._get_data()                                                                                      
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1285, in _get_data                                                                                                              
    success, data = self._try_get_data()                                                                                                                                                                                                    
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    data = self._data_queue.get(timeout=timeout)                                                                      
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/queue.py", line 180, in get                                   
    self.not_empty.wait(remaining)                                                                                    
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/threading.py", line 324, in wait                              
    gotit = waiter.acquire(True, timeout)                                                                                                                                                                                                   
KeyboardInterrupt

As usual, I start with suspection of PyTorch. Is the version of PyTorch too new (2.0) that it includes some flaws? Then I quickly rejected my thoughts: if it’s the problem of PyTorch, why it didn’t meet same situation when not using Siamese network?

Then I found this issue in PyTorch GitHub page. It pointed to the clue: the new code in “dataset.py”. Now I notice the problem in my code:

            arr = self.cat_map[ebird_code]
            pair_wav_name = np.random.choice(arr)
            while pair_wav_name == wav_name:
                pair_wav_name = np.random.choice(arr)
            pair_sound = self.get_sound(pair_wav_name, ebird_code)

If a category only have one file, this loop will continue forever. This is the reason of the hang.

The solution is simple:

            arr = self.cat_map[ebird_code]
            if len(arr) > 1:
                pair_wav_name = np.random.choice(arr)
                while pair_wav_name == wav_name:
                    pair_wav_name = np.random.choice(arr)
            else:
                pair_wav_name = wav_name
            pair_sound = self.get_sound(pair_wav_name, ebird_code)

Fix an Out Of Memory case

Here is my code and it caused OOM (“Out Of Memory”) when running:

for img_batch in data_load:
	input = torch.from_numpy(np.asarray(img_batch)).cuda()
    result = self._net(input.permute(0, 3, 1, 2).float())
    values, indices = torch.topk(result, 10)
    for index in range(len(values)):
    	top10 = values[0]
    	statistics["accumulate"] += top10[0]

It firstly caused CUDA to report OOM so I just stupidly removed the “cuda()” to let inference run only on the CPU.

But quickly, the CPU program also reports OOM. And this time I realised that the variable “top10” is an array of tensors, not integers. Therefore I should use “top10[0].item()” to convert it to a pure integer before adding it to the statistics dictionary.

The correct code should be:

...
    for index in range(len(values)):
    	top10 = values[0]
    	statistics["accumulate"] += top10[0].item()

Take care of the data type when using PyTorch.

A strange problem in RegNetY-32G

I have been using RegNetY in DongNiao for almost two years. Previously I was just using small models such as RegNetY-8G. But after having a computer with RTX-3080-TI, I started to use the biggest one in the original paper — RegNetY-32G.

RegNeyY-32G model costs a lot of time for training so I would use mixed-precision in the process. However, after using “float16”, the training program always crashes with the error of overflow:

...
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.0237e-320                      
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.012e-320                       
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.06e-321                                                                                                                                 
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.53e-321                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.265e-321                       
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.3e-322                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.16e-322                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.6e-322                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5e-324                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0

Firstly, I suspected that the bigger model couldn’t hold a large learning rate (I used 8.0 for a long time) with “float16” training. So I reduced the learning rate to just 1e-1. The model stopped to report overflow error but the loss couldn’t converge and just stay constantly at about 9.

Then I have no choice but to adjust the parameters step by step to find a set of hyper-parameters for converging. Finally, I found the reason: the enabling of Squeeze-and-Excitation block in RegNetY makes the model harder to converge. The exponential operation in the Sigmoid function might be the cause since “float16” can’t always process exponential change properly.

The solution is simple: just disable the Squeeze-and-Excitation block in RegNetY:

    cfg.MODEL.TYPE = "regnet"
    # RegNetY-32.0GF
    cfg.REGNET.DEPTH = 20
    cfg.REGNET.SE_ON = False
    cfg.REGNET.W0 = 232
    cfg.REGNET.WA = 115.89
    cfg.REGNET.WM = 2.53
    cfg.REGNET.GROUP_W = 232
    cfg.BN.NUM_GROUPS = 4
    cfg.MODEL.NUM_CLASSES = config["num_classes"]
    net = model_builder.build_model()

I may need to use Hard Sigmoid in the Squeeze-and-Excitation block for the experiment in the future.

Strange error from Nvidia’s apex library

apex is a mixed training library from Nvidia. I have been using it since I got an RTX3080TI GPU. A few days ago, I started to use RegNetY-32GF (I just used RegNetY models smaller than16GF previously). After a accidental break, I tried to resume the training but it reported:

Traceback (most recent call last):
  File "train.py", line 353, in <module>
    train(args, train_loader, eval_loader)
  File "train.py", line 220, in train
    scaled_loss.backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 401, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 191, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([28, 3712, 10, 10], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(3712, 3712, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams 
    data_type = CUDNN_DATA_HALF
    padding = [0, 0, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x55d2a620ff60
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 28, 3712, 10, 10, 
    strideA = 371200, 100, 10, 1, 
output: TensorDescriptor 0x55d2a6215310
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 28, 3712, 10, 10, 
    strideA = 371200, 100, 10, 1, 
weight: FilterDescriptor 0x7fd9e806f1e0
    type = CUDNN_DATA_HALF
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 3712, 3712, 1, 1, 
Pointer addresses: 
    input: 0x7fd73fde3a00
    output: 0x7fd746abb600
    weight: 0x7fd761b5de00

This error looks quite scary so the first thought that comes to my mind is that: the training environment has crashed! Then I download the newest GPU driver and get the most updated docker container for PyTorch. But the error continues.

As a second thought, I began to suspect that apex couldn’t handle too big models…(what’s in my mind?) therefore I modified my code to use “torch.cuda.amp” instead of “apex.amp” as the document. Fortunately, the error disappears but I have to use a smaller batch size. Looks like the “torch.cuda.amp” couldn’t reduce enough GPU memory as “apex.amp”.

However, the story doesn’t end here. Just before writing this article, I just used a smaller batch size for my old code with “torch.cuda.amp”, and it works well…

All in all, the terrible error above is simply caused by insufficient GPU memory.

Accelerate inference speed of DNN on Intel CPU

To save the cost on the inference server, I did some experiments on how to accelerate the speed of prediction for our model.

import torch.nn as nn

import pycls.core.builders as model_builder
from pycls.core.config import cfg

def pressure_predict(net, tensor_img):
    t0 = time.time()
    for _ in range(10):
        result = net(tensor_img)
        result = softmax(result)
        values, indices = torch.topk(result, 10)
    t1 = time.time()
    print("time:", t1 - t0)
    print(values)

if __name__ == "__main__":
    cfg.MODEL.TYPE = "regnet"
    # RegNetY-8.0GF
    cfg.REGNET.DEPTH = 17
    cfg.REGNET.SE_ON = False
    cfg.REGNET.W0 = 192
    cfg.REGNET.WA = 76.82
    cfg.REGNET.WM = 2.19
    cfg.REGNET.GROUP_W = 56
    cfg.BN.NUM_GROUPS = 4
    cfg.MODEL.NUM_CLASSES = 11120
    net = model_builder.build_model()
    net.load_state_dict(torch.load("bird_cls_2754696.pth", map_location="cpu"))
    net.eval()
    net = net.float()
    softmax = nn.Softmax(dim=1).eval()

    # read image
    img = cv2.imread("blujay.jpg")
    img = cv2.resize(img, (300, 300))
    tensor_img = torch.from_numpy(img).unsqueeze(0).permute(0, 3, 1, 2).float()
    pressure_predict(net, tensor_img)

    dummy_input = torch.randn(1, 3, 300, 300)
    with torch.jit.optimized_execution(True):
        traced_script_module = torch.jit.trace(net, dummy_input)

    net = torch.jit.optimize_for_inference(traced_script_module)
    pressure_predict(net, tensor_img)

    import intel_extension_for_pytorch as ipex
    net = net.to(memory_format=torch.channels_last)
    net = ipex.optimize(net)
    tensor_img = tensor_img.to(memory_format=torch.channels_last)

    with torch.no_grad():
        pressure_predict(net, tensor_img)

Here is the output on my Intel i5-12400 CPU:

	inference time (seconds per 10 times)
Directly use model	1.6
After PyTorch’s torch.jit.optimize_for_inference()	1.4
After Intel’s ipex.optimize()	0.8

Looks like Intel tried hard to optimize their CPU for neural network models. But the only problem is that the intel_extension_for_pytorch the package is hard to install (a lot of broken dependencies when I am trying to install and run it), and the best way to use it is through the docker image intel/intel-optimized-pytorch:latest

Average weights of two Pytorch models

After reading this paper, I begin to do an experiment about it. Referencing this snippet, I wrote my code:

    net1 = model_builder.build_model()
    net2 = model_builder.build_model()
    output = model_builder.build_model()
    net1.load_state_dict(torch.load(args.model1, map_location="cpu"))
    net2.load_state_dict(torch.load(args.model2, map_location="cpu"))
    
    # Average
    sd1 = net1.named_parameters()
    sd2 = net2.named_parameters()
    sdo = dict(sd2)
    for name, param in sd1:
        sdo[name].data.copy_(0.5*param.data + 0.5*sdo[name].data)

    output.load_state_dict(sdo)
    torch.save(output, args.output)
    
    # here is a test
    output.load_state_dict(torch.load(args.output))

But after generating the average-weights new model, the PyTorch failed to load it:

Traceback (most recent call last):
  File "average_models.py", line 43, in <module>
    output.load_state_dict(torch.load(args.output))
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1534, in load_state_dict
    state_dict = state_dict.copy()
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'RegNet' object has no attribute 'copy'

The reason for failure is quite simple: we only need to save the state_dict of the model instead of all information (since I am using FP16 format ). Therefore the correct code should be:

    net1 = model_builder.build_model()
    net2 = model_builder.build_model()
    net1.load_state_dict(torch.load(args.model1, map_location="cpu"))
    net2.load_state_dict(torch.load(args.model2, map_location="cpu"))

    # Average 
    sd1 = net1.named_parameters()
    sd2 = net2.named_parameters()
    sdo = dict(sd2) 
    for name, param in sd1:
        sdo[name].data.copy_(0.5*param.data + 0.5*sdo[name].data)

    torch.save(sdo, args.output)

BTW, the averaging of my models doesn’t rise accuracy as the paper suggests in my experiment.

Insert multiple lines in a specific position of a file

I used awk for quite a long time, but not his brother sed. A couple of days ago I want to insert two lines for a CMake file in a specific position and find a perfect answer: here.

Now I could add two lines by using:

sed -i '/^enable_language/i set(CMAKE_CUDA_ARCHITECTURES 86)\nset(CMAKE_CUDA_COMPILER /usr/local/cuda/bin/nvcc)' \
	cmake/public/cuda.cmake

The CMake file changed from:

if("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
   set(CMAKE_CUDA_HOST_COMPILER "${CMAKE_C_COMPILER}")
endif()
enable_language(CUDA)
set(CMAKE_CUDA_STANDARD ${CMAKE_CXX_STANDARD})
set(CMAKE_CUDA_STANDARD_REQUIRED ON)

if("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
   set(CMAKE_CUDA_HOST_COMPILER "${CMAKE_C_COMPILER}")
endif()
set(CMAKE_CUDA_ARCHITECTURES 86)
set(CMAKE_CUDA_COMPILER /usr/local/cuda/bin/nvcc)
enable_language(CUDA)
set(CMAKE_CUDA_STANDARD ${CMAKE_CXX_STANDARD})
set(CMAKE_CUDA_STANDARD_REQUIRED ON)

Model saving error when using Apex

Apex is a tool to enable mixed-precision training that comes from Nvidia.

import apex.amp as amp

net, optimizer = amp.initialize(net, optimizer, opt_level="O2")

# forward
outputs = net(inputs)

loss = criterion(outputs, targets)

optimizer.zero_grad()

# float16 backward
with amp.scale_loss(loss, optimizer) as scaled_loss:
  scaled_loss.backward()
  
optimizer.step()

...

torch.save(net, "model.pth")

After I changed my code to use Apex, it reported an error when saving the model by using torch.save(net, "model.pth")

AttributeError: Can't pickle local object '_initialize.<locals>.patch_forward.<locals>.new_fwd'

Someone has already noticed this problem but it seems no one wants to solve it: link. The only solution for this comes from a Chinese blog: link. It recommends just saving model parameters:

torch.save(net.state_dict(), "model.pth")

Debug CUDA error for PyTorch

After I changed my dataset for my code, the training failed:

/tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0], thread: [59,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0], thread: [60,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0], thread: [61,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0], thread: [62,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0], thread: [63,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "train.py", line 337, in <module>
    train(args, train_loader, eval_loader)
  File "train.py", line 189, in train
    sounds = aug(sounds)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 881, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sanbai/birds_sound_classification/utils/augment.py", line 13, in forward
    image = (image - image.mean()) / image.std()
RuntimeError: CUDA error: device-side assert triggered

It’s terribly hard to find out the reason for this common error “RuntimeError: CUDA error: device-side assert triggered”. But someone on Github recommends a method: adding CUDA_LAUNCH_BLOCKING=1 before the program.

Now the real error behind RuntimeError shows up: it’s the wrong number of categories I set to the model.

A struggle to keep the accuracy

In this August, we have got 0.83 evaluation accuracy for DIB-10K dataset. But since last month, we have updated the dataset and the accuracy could only get to 0.82.

The first doubtful point is the Weight Standardization method we used for micro-batch (since the model is too big). So I turned to try gradient-accumulation and use this snippet as an example because it won’t need me to change my code heavily:

model.zero_grad()                                   # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs)                     # Forward pass
    loss = loss_function(predictions, labels)
    loss = loss / accumulation_steps
    loss.backward()                                 # Backward pass
    if (i+1) % accumulation_steps == 0:
        optimizer.step()
        model.zero_grad()
        if (i+1) % evaluation_steps == 0:
            evaluate_model()

But after changing my code and retrain the model, the accuracy still keep around 0.82:

Epoch     4: reducing learning rate of group 0 to 1.0000e-01.
[2020-12-16 05:53:29] Eval accuracy: 0.8283 | Train accuracy: 0.8187
[2020-12-16 10:01:40] Eval accuracy: 0.8284 | Train accuracy: 0.8938
[2020-12-16 14:11:35] Eval accuracy: 0.8284 | Train accuracy: 0.8313
Epoch     7: reducing learning rate of group 0 to 5.0000e-02.
[2020-12-16 18:21:47] Eval accuracy: 0.8285 | Train accuracy: 0.8750
[2020-12-16 22:31:19] Eval accuracy: 0.8285 | Train accuracy: 0.8313
[2020-12-17 02:41:37] Eval accuracy: 0.8284 | Train accuracy: 0.8625
Epoch    10: reducing learning rate of group 0 to 2.5000e-02.
[2020-12-17 06:52:05] Eval accuracy: 0.8286 | Train accuracy: 0.8500
[2020-12-17 11:02:11] Eval accuracy: 0.8285 | Train accuracy: 0.8063
[2020-12-17 15:12:23] Eval accuracy: 0.8286 | Train accuracy: 0.8375
Epoch    13: reducing learning rate of group 0 to 1.2500e-02.
[2020-12-17 19:22:04] Eval accuracy: 0.8285 | Train accuracy: 0.8313

This makes me really desperate. Maybe I should temporarily put this task aside and go on other works.