machine learning

How to get results of YOLOv5

I know that we can directly use “results.show()” to get the image with objects being figured out. But what if I want to just show some objects that have bigger confidence than a threshold? Then we need to fetch the results one by one manually:

import torch
import cv2

model = torch.hub.load('.', 'custom', path='best.pt', source='local')
model.eval()

colors = {
    14: (0,255,0),
    80: (0,0,255)
}

names = {
    14: "bird",
    80: "squirrel"
}

for index in ["1.jpeg", "2.jpeg", "4.jpeg", "7.jpeg", "3.webp", "5.webp", "6.webp", "8.png"]:
    img_name = f"squirrel_bird{index}"
    image = cv2.imread(img_name)
    results = model(img, size=960)
    for obj in results.pred[0]:
        x1, y1, x2, y2, conf, cat = obj.numpy()
        x1, y1, x2, y2, cat = int(x1), int(y1), int(x2), int(y2), int(cat)
        print(x1, y1, x2, y2, conf, cat)
        if conf > 0.581 and cat in colors.keys():
            cv2.rectangle(image, (x1, y1), (x2, y2), colors[cat], 2)
            cv2.putText(image, f"{names[cat]},{conf:.2f}", (x1, y1+12), cv2.FONT_HERSHEY_SIMPLEX, 0.5, colors[cat], 1, 2)
    cv2.imshow("yolov5", image)
    cv2.waitKey(0)

The key is to get tensors from “results.pred[0]”, and get coordinates/confidence/category from every tensor.

A trick for using YOLOv5

To detect birds and squirrels, we created a dataset to train the YOLOv5 model. After a week’s training with:

python3 -u train.py --data coco.yaml --cfg yolov5s.yaml --weights '' --batch-size 28 --workers 1

The model could recognize birds and squirrels properly except only for this image:

Why does the model recognize the right-side significant squirrel as a bird? Even though I tried a bigger model, the result was the same…

Only after researching the parameters of the function model() of YOLOv5, I found out we can use a different image size: 960 for detecting.

import inspect
import torch
import cv2

model = torch.hub.load('.', 'custom', path='last.pt', source='local')
#model = torch.hub.load('ultralytics/yolov5', 'yolov5s')
model.eval()

image = cv2.imread(img)
results = model(img, size=960)
results.show()

The result is below for model(img, size=960)

Hmm, seems the single-stage YOLOv5 model is nearsighted, just like me…

A strange problem in RegNetY-32G

I have been using RegNetY in DongNiao for almost two years. Previously I was just using small models such as RegNetY-8G. But after having a computer with RTX-3080-TI, I started to use the biggest one in the original paper — RegNetY-32G.

RegNeyY-32G model costs a lot of time for training so I would use mixed-precision in the process. However, after using “float16”, the training program always crashes with the error of overflow:

...
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.0237e-320                      
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.012e-320                       
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.06e-321                                                                                                                                 
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.53e-321                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.265e-321                       
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.3e-322                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.16e-322                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.6e-322                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5e-324                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0

Firstly, I suspected that the bigger model couldn’t hold a large learning rate (I used 8.0 for a long time) with “float16” training. So I reduced the learning rate to just 1e-1. The model stopped to report overflow error but the loss couldn’t converge and just stay constantly at about 9.

Then I have no choice but to adjust the parameters step by step to find a set of hyper-parameters for converging. Finally, I found the reason: the enabling of Squeeze-and-Excitation block in RegNetY makes the model harder to converge. The exponential operation in the Sigmoid function might be the cause since “float16” can’t always process exponential change properly.

The solution is simple: just disable the Squeeze-and-Excitation block in RegNetY:

    cfg.MODEL.TYPE = "regnet"
    # RegNetY-32.0GF
    cfg.REGNET.DEPTH = 20
    cfg.REGNET.SE_ON = False
    cfg.REGNET.W0 = 232
    cfg.REGNET.WA = 115.89
    cfg.REGNET.WM = 2.53
    cfg.REGNET.GROUP_W = 232
    cfg.BN.NUM_GROUPS = 4
    cfg.MODEL.NUM_CLASSES = config["num_classes"]
    net = model_builder.build_model()

I may need to use Hard Sigmoid in the Squeeze-and-Excitation block for the experiment in the future.

Strange error from Nvidia’s apex library

apex is a mixed training library from Nvidia. I have been using it since I got an RTX3080TI GPU. A few days ago, I started to use RegNetY-32GF (I just used RegNetY models smaller than16GF previously). After a accidental break, I tried to resume the training but it reported:

Traceback (most recent call last):
  File "train.py", line 353, in <module>
    train(args, train_loader, eval_loader)
  File "train.py", line 220, in train
    scaled_loss.backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 401, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 191, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([28, 3712, 10, 10], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(3712, 3712, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams 
    data_type = CUDNN_DATA_HALF
    padding = [0, 0, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x55d2a620ff60
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 28, 3712, 10, 10, 
    strideA = 371200, 100, 10, 1, 
output: TensorDescriptor 0x55d2a6215310
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 28, 3712, 10, 10, 
    strideA = 371200, 100, 10, 1, 
weight: FilterDescriptor 0x7fd9e806f1e0
    type = CUDNN_DATA_HALF
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 3712, 3712, 1, 1, 
Pointer addresses: 
    input: 0x7fd73fde3a00
    output: 0x7fd746abb600
    weight: 0x7fd761b5de00

This error looks quite scary so the first thought that comes to my mind is that: the training environment has crashed! Then I download the newest GPU driver and get the most updated docker container for PyTorch. But the error continues.

As a second thought, I began to suspect that apex couldn’t handle too big models…(what’s in my mind?) therefore I modified my code to use “torch.cuda.amp” instead of “apex.amp” as the document. Fortunately, the error disappears but I have to use a smaller batch size. Looks like the “torch.cuda.amp” couldn’t reduce enough GPU memory as “apex.amp”.

However, the story doesn’t end here. Just before writing this article, I just used a smaller batch size for my old code with “torch.cuda.amp”, and it works well…

All in all, the terrible error above is simply caused by insufficient GPU memory.

Accelerate inference speed of DNN on Intel CPU

To save the cost on the inference server, I did some experiments on how to accelerate the speed of prediction for our model.

import torch.nn as nn

import pycls.core.builders as model_builder
from pycls.core.config import cfg

def pressure_predict(net, tensor_img):
    t0 = time.time()
    for _ in range(10):
        result = net(tensor_img)
        result = softmax(result)
        values, indices = torch.topk(result, 10)
    t1 = time.time()
    print("time:", t1 - t0)
    print(values)

if __name__ == "__main__":
    cfg.MODEL.TYPE = "regnet"
    # RegNetY-8.0GF
    cfg.REGNET.DEPTH = 17
    cfg.REGNET.SE_ON = False
    cfg.REGNET.W0 = 192
    cfg.REGNET.WA = 76.82
    cfg.REGNET.WM = 2.19
    cfg.REGNET.GROUP_W = 56
    cfg.BN.NUM_GROUPS = 4
    cfg.MODEL.NUM_CLASSES = 11120
    net = model_builder.build_model()
    net.load_state_dict(torch.load("bird_cls_2754696.pth", map_location="cpu"))
    net.eval()
    net = net.float()
    softmax = nn.Softmax(dim=1).eval()

    # read image
    img = cv2.imread("blujay.jpg")
    img = cv2.resize(img, (300, 300))
    tensor_img = torch.from_numpy(img).unsqueeze(0).permute(0, 3, 1, 2).float()
    pressure_predict(net, tensor_img)

    dummy_input = torch.randn(1, 3, 300, 300)
    with torch.jit.optimized_execution(True):
        traced_script_module = torch.jit.trace(net, dummy_input)

    net = torch.jit.optimize_for_inference(traced_script_module)
    pressure_predict(net, tensor_img)

    import intel_extension_for_pytorch as ipex
    net = net.to(memory_format=torch.channels_last)
    net = ipex.optimize(net)
    tensor_img = tensor_img.to(memory_format=torch.channels_last)

    with torch.no_grad():
        pressure_predict(net, tensor_img)

Here is the output on my Intel i5-12400 CPU:

	inference time (seconds per 10 times)
Directly use model	1.6
After PyTorch’s torch.jit.optimize_for_inference()	1.4
After Intel’s ipex.optimize()	0.8

Looks like Intel tried hard to optimize their CPU for neural network models. But the only problem is that the intel_extension_for_pytorch the package is hard to install (a lot of broken dependencies when I am trying to install and run it), and the best way to use it is through the docker image intel/intel-optimized-pytorch:latest

Average weights of two Pytorch models

After reading this paper, I begin to do an experiment about it. Referencing this snippet, I wrote my code:

    net1 = model_builder.build_model()
    net2 = model_builder.build_model()
    output = model_builder.build_model()
    net1.load_state_dict(torch.load(args.model1, map_location="cpu"))
    net2.load_state_dict(torch.load(args.model2, map_location="cpu"))
    
    # Average
    sd1 = net1.named_parameters()
    sd2 = net2.named_parameters()
    sdo = dict(sd2)
    for name, param in sd1:
        sdo[name].data.copy_(0.5*param.data + 0.5*sdo[name].data)

    output.load_state_dict(sdo)
    torch.save(output, args.output)
    
    # here is a test
    output.load_state_dict(torch.load(args.output))

But after generating the average-weights new model, the PyTorch failed to load it:

Traceback (most recent call last):
  File "average_models.py", line 43, in <module>
    output.load_state_dict(torch.load(args.output))
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1534, in load_state_dict
    state_dict = state_dict.copy()
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'RegNet' object has no attribute 'copy'

The reason for failure is quite simple: we only need to save the state_dict of the model instead of all information (since I am using FP16 format ). Therefore the correct code should be:

    net1 = model_builder.build_model()
    net2 = model_builder.build_model()
    net1.load_state_dict(torch.load(args.model1, map_location="cpu"))
    net2.load_state_dict(torch.load(args.model2, map_location="cpu"))

    # Average 
    sd1 = net1.named_parameters()
    sd2 = net2.named_parameters()
    sdo = dict(sd2) 
    for name, param in sd1:
        sdo[name].data.copy_(0.5*param.data + 0.5*sdo[name].data)

    torch.save(sdo, args.output)

BTW, the averaging of my models doesn’t rise accuracy as the paper suggests in my experiment.

Some test samples for Text-To-Speech solutions

I am doing some research on TTS (Text-To-Speech) recently and noticed three almost state-of-the-art and also out-of-the-box solutions: LightSpeech (from Microsoft), FastSpeech2 (partly from Microsoft), Nemo (from Nvidia).

The testing text is a paragraph:

The Home Depot, Inc. is the world’s largest home improvement retailer based on net sales for fiscal 2021. We offer our customers a wide assortment of building materials, home improvement products, lawn and garden products, décor products, and facilities maintenance, repair and operations products and provide a number of services, including home improvement installation services and tool and equipment rental. As of the end of fiscal 2021, we operated 2,317 stores located throughout the U.S. (including the Commonwealth of Puerto Rico and the territories of the U.S. Virgin Islands and Guam), Canada, and Mexico. The Home Depot stores average approximately 104,000 square feet of enclosed space, with approximately 24,000 additional square feet of outside garden area. We also maintain a network of distribution and fulfillment centers, as well as a number of e-commerce websites in the U.S., Canada and Mexico. When we refer to “The Home Depot,” the “Company,” “we,” “us” or “our” in this report, we are referring to The Home Depot, Inc. and its consolidated subsidiaries.

The output of FastSpeech2:

it has a lot of noise and sounds like some type of metal.

The output of LightSpeech:

sounds a little better, more like human instead of robots

The output of Nemo:

this is the best result of all three solutions.

This test is just a summary of my research works and doesn’t mean which algorithm is better than others since the training process will heavily affect the final result. But at least, Nemo is the nearest one to the product scenario.

Model saving error when using Apex

Apex is a tool to enable mixed-precision training that comes from Nvidia.

import apex.amp as amp

net, optimizer = amp.initialize(net, optimizer, opt_level="O2")

# forward
outputs = net(inputs)

loss = criterion(outputs, targets)

optimizer.zero_grad()

# float16 backward
with amp.scale_loss(loss, optimizer) as scaled_loss:
  scaled_loss.backward()
  
optimizer.step()

...

torch.save(net, "model.pth")

After I changed my code to use Apex, it reported an error when saving the model by using torch.save(net, "model.pth")

AttributeError: Can't pickle local object '_initialize.<locals>.patch_forward.<locals>.new_fwd'

Someone has already noticed this problem but it seems no one wants to solve it: link. The only solution for this comes from a Chinese blog: link. It recommends just saving model parameters:

torch.save(net.state_dict(), "model.pth")

Finding problem about ‘Nan’ result in model training

Intending to use distilling for training my model. The Plan is:

Train model A and model B with same code and same dataset
Predict the dataset with model A and model B, and store the average of their result
Use the average prediction as the target of a new training process

Step 1 and Step 2 are successful. But when I run the new training process, it will report the loss as “Nan” after some steps.

To find out the reason, I started to print all the “average prediction results” for every step. At first, they look just as normal, but after a while, I find out that some input has “Nan”.

Why there is “Nan” in the “average prediction results”? I guess the reason is: some samples are too rare (or special) so the model will give an unreliable output. It’s quite easy to just ignore them:

if np.isnan(label).any() or not np.isfinite(label).all():
  # Drop the corresponding sample
  return None

Now the distilling training could go on.

My summary for the paper “Unified Language Model Pre-training for Natural Language Understanding and Generation”

For NLU (Natural Language Understanding), we use the bidirectional language model (like BERT), but for NLG(Natural Language Generation), the left-to-right unidirectional language model (like GPT) is the only choice.

Could we accomplish these two tasks by using one unified language model?

In this paper, the authors use a mask matrix to run different tasks in the same model:

The pivotal equation for this method is:

“M is the mask matrix and determines whether a pair of tokens can be attended to each other.”

“Unidirectional LM is done by using a triangular matrix for the self-attention mask M (as in the above equation), where the upper triangular part of the self-attention mask is set to −∞, and the other elements to 0”

“Within one training batch, 1/3 of the time we use the bidirectional LM objective, 1/3 of the time we employ the sequence-to-sequence LM objective, and both left-to-right and right-to-left LM objectives are sampled with the rate of 1/6”

Keep a note that the training process use bidirectional/unidirectional/seq2seq objective, not samples)