Robin on Linux – Page 6 – All about technology

Fixing reported error from terragrunt

When using “terragrunt init”, an error jumped out:

DEBU[0002] Remote state GCS bucket prokyon-systems-state-bucket does not exist. Attempting to create it  prefix=[/data/proj/prokyon-systems/auto-accountant/infra/non-prod/europe-central/dev/storage-bucket] 
ERRO[0002] Missing required GCS remote state configuration project 
ERRO[0002] Unable to determine underlying exit code, so Terragrunt will exit with error code 1

This error looks a little confusing because my attention was completely drawn by the “remote state configuration”. Where could I find the “remote state configuration”? Then I found it in “terragrunt.hcl”:

remote_state {
  backend = "gcs"
  config = {
    bucket  = "my_bucket"
    prefix  = "my_prefix/terraform.tfstate"
  }
}

Which “configuration” did I miss? I went through a long way to finally realize that the plain word “project” at the end of the sentence “Missing required GCS remote state configuration project” is the most important one (this article). I just need to add a configuration item “project” in the “terragrunt.hcl”:

remote_state {
  backend = "gcs"
  config = {
    project = "gcp_project_id"
    bucket  = "my_bucket"
    prefix  = "my_prefix/terraform.tfstate"
  }
}

Where is the error log of Vertex AI Pipeline?

I already knew that Vertex AI is a new product of GCP so it tends to be unstable. But I didn’t imagine it lacks a vital function: log errors or exceptions thrown by python code.

After I added code from this article, some of my steps, fortunately, reported error logs. But for some other steps, they jump out with just a simple message:

You may think I could find more detail in the URL it gives at the end. But that’s not true. After I copy and jump to that URL. It showed nothing:

Don’t be confused. The blank above is from the GUI of GCP because there is no record.

If a pipeline framework couldn’t even capture all the errors from its steps, I would not dare to use it in production.

Fix an Out Of Memory case

Here is my code and it caused OOM (“Out Of Memory”) when running:

for img_batch in data_load:
	input = torch.from_numpy(np.asarray(img_batch)).cuda()
    result = self._net(input.permute(0, 3, 1, 2).float())
    values, indices = torch.topk(result, 10)
    for index in range(len(values)):
    	top10 = values[0]
    	statistics["accumulate"] += top10[0]

It firstly caused CUDA to report OOM so I just stupidly removed the “cuda()” to let inference run only on the CPU.

But quickly, the CPU program also reports OOM. And this time I realised that the variable “top10” is an array of tensors, not integers. Therefore I should use “top10[0].item()” to convert it to a pure integer before adding it to the statistics dictionary.

The correct code should be:

...
    for index in range(len(values)):
    	top10 = values[0]
    	statistics["accumulate"] += top10[0].item()

Take care of the data type when using PyTorch.

How to get results of YOLOv5

I know that we can directly use “results.show()” to get the image with objects being figured out. But what if I want to just show some objects that have bigger confidence than a threshold? Then we need to fetch the results one by one manually:

import torch
import cv2

model = torch.hub.load('.', 'custom', path='best.pt', source='local')
model.eval()

colors = {
    14: (0,255,0),
    80: (0,0,255)
}

names = {
    14: "bird",
    80: "squirrel"
}

for index in ["1.jpeg", "2.jpeg", "4.jpeg", "7.jpeg", "3.webp", "5.webp", "6.webp", "8.png"]:
    img_name = f"squirrel_bird{index}"
    image = cv2.imread(img_name)
    results = model(img, size=960)
    for obj in results.pred[0]:
        x1, y1, x2, y2, conf, cat = obj.numpy()
        x1, y1, x2, y2, cat = int(x1), int(y1), int(x2), int(y2), int(cat)
        print(x1, y1, x2, y2, conf, cat)
        if conf > 0.581 and cat in colors.keys():
            cv2.rectangle(image, (x1, y1), (x2, y2), colors[cat], 2)
            cv2.putText(image, f"{names[cat]},{conf:.2f}", (x1, y1+12), cv2.FONT_HERSHEY_SIMPLEX, 0.5, colors[cat], 1, 2)
    cv2.imshow("yolov5", image)
    cv2.waitKey(0)

The key is to get tensors from “results.pred[0]”, and get coordinates/confidence/category from every tensor.

A trick for using YOLOv5

To detect birds and squirrels, we created a dataset to train the YOLOv5 model. After a week’s training with:

python3 -u train.py --data coco.yaml --cfg yolov5s.yaml --weights '' --batch-size 28 --workers 1

The model could recognize birds and squirrels properly except only for this image:

Why does the model recognize the right-side significant squirrel as a bird? Even though I tried a bigger model, the result was the same…

Only after researching the parameters of the function model() of YOLOv5, I found out we can use a different image size: 960 for detecting.

import inspect
import torch
import cv2

model = torch.hub.load('.', 'custom', path='last.pt', source='local')
#model = torch.hub.load('ultralytics/yolov5', 'yolov5s')
model.eval()

image = cv2.imread(img)
results = model(img, size=960)
results.show()

The result is below for model(img, size=960)

Hmm, seems the single-stage YOLOv5 model is nearsighted, just like me…

A strange error in BigQuery

Two days ago we met a weird error when running a select through BigQuery Python API:

Error : google.api_core.exceptions.BadRequest: 400 Bad int64 value: BA1D

I checked the select SQL but it doesn’t contain any type like “int64”.

After “binary search” in the SQL code, I finally found out that the SQL is actually querying a “view” and the code of this view is like:

SELECT
  cast(col1, int64) AS COL1,
  cast(col2, int64) AS COL2,
FROM
  table1

The correct solution is to change “cast” to “safe_cast”.

Here is the lesson for me: some errors may occur not only in the direct SQL code but in some indirect views…

A strange problem in RegNetY-32G

I have been using RegNetY in DongNiao for almost two years. Previously I was just using small models such as RegNetY-8G. But after having a computer with RTX-3080-TI, I started to use the biggest one in the original paper — RegNetY-32G.

RegNeyY-32G model costs a lot of time for training so I would use mixed-precision in the process. However, after using “float16”, the training program always crashes with the error of overflow:

...
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.0237e-320                      
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.012e-320                       
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.06e-321                                                                                                                                 
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.53e-321                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.265e-321                       
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.3e-322                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.16e-322                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.6e-322                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5e-324                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0

Firstly, I suspected that the bigger model couldn’t hold a large learning rate (I used 8.0 for a long time) with “float16” training. So I reduced the learning rate to just 1e-1. The model stopped to report overflow error but the loss couldn’t converge and just stay constantly at about 9.

Then I have no choice but to adjust the parameters step by step to find a set of hyper-parameters for converging. Finally, I found the reason: the enabling of Squeeze-and-Excitation block in RegNetY makes the model harder to converge. The exponential operation in the Sigmoid function might be the cause since “float16” can’t always process exponential change properly.

The solution is simple: just disable the Squeeze-and-Excitation block in RegNetY:

    cfg.MODEL.TYPE = "regnet"
    # RegNetY-32.0GF
    cfg.REGNET.DEPTH = 20
    cfg.REGNET.SE_ON = False
    cfg.REGNET.W0 = 232
    cfg.REGNET.WA = 115.89
    cfg.REGNET.WM = 2.53
    cfg.REGNET.GROUP_W = 232
    cfg.BN.NUM_GROUPS = 4
    cfg.MODEL.NUM_CLASSES = config["num_classes"]
    net = model_builder.build_model()

I may need to use Hard Sigmoid in the Squeeze-and-Excitation block for the experiment in the future.

Strange error from Nvidia’s apex library

apex is a mixed training library from Nvidia. I have been using it since I got an RTX3080TI GPU. A few days ago, I started to use RegNetY-32GF (I just used RegNetY models smaller than16GF previously). After a accidental break, I tried to resume the training but it reported:

Traceback (most recent call last):
  File "train.py", line 353, in <module>
    train(args, train_loader, eval_loader)
  File "train.py", line 220, in train
    scaled_loss.backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 401, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 191, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([28, 3712, 10, 10], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(3712, 3712, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams 
    data_type = CUDNN_DATA_HALF
    padding = [0, 0, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x55d2a620ff60
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 28, 3712, 10, 10, 
    strideA = 371200, 100, 10, 1, 
output: TensorDescriptor 0x55d2a6215310
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 28, 3712, 10, 10, 
    strideA = 371200, 100, 10, 1, 
weight: FilterDescriptor 0x7fd9e806f1e0
    type = CUDNN_DATA_HALF
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 3712, 3712, 1, 1, 
Pointer addresses: 
    input: 0x7fd73fde3a00
    output: 0x7fd746abb600
    weight: 0x7fd761b5de00

This error looks quite scary so the first thought that comes to my mind is that: the training environment has crashed! Then I download the newest GPU driver and get the most updated docker container for PyTorch. But the error continues.

As a second thought, I began to suspect that apex couldn’t handle too big models…(what’s in my mind?) therefore I modified my code to use “torch.cuda.amp” instead of “apex.amp” as the document. Fortunately, the error disappears but I have to use a smaller batch size. Looks like the “torch.cuda.amp” couldn’t reduce enough GPU memory as “apex.amp”.

However, the story doesn’t end here. Just before writing this article, I just used a smaller batch size for my old code with “torch.cuda.amp”, and it works well…

All in all, the terrible error above is simply caused by insufficient GPU memory.

Conditional checking in Argo workflow

My company has been using Argo for executing workflow for more than three years. I knew every step in the Argo workflow could be controlled by when expression, like this:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: example-
spec:
  entrypoint: first-step
  arguments:
  - parameters:
    - - name: 'COLOR',
        value: std.extVar('COLOR'),
  templates:
  - name: first-stage
    steps:
    # flip a coin
    - - name: first-step
        template: step-first
        when: '{{workflow.parameters.COLOR}} == red',

What if I want to check more than one condition? Just use “&&” as “and”, “||” as “or”:

...
  - name: first-stage
    steps:
    # flip a coin
    - - name: first-step
        template: step-first
        when: '{{workflow.parameters.COLOR}} == red || {{workflow.parameters.COLOR}} == blue',

Reference

The correct way to insert data from another table in BigQuery

Incorrect code:

WITH source1 as (
	SELECT blah FROM blah
),
source2 as (
    SELECT moreblah FROM source1
)
INSERT INTO newtable FROM source2;

Correct solution:

INSERT INTO newtable 
    WITH source1 as (
    	SELECT blah FROM blah
    ),
    source2 as (
        SELECT moreblah FROM source1
    )
    SELECT * FROM source2;