RobinDong

Training CIFAR-100 by DeepSpeed

To let DeepSpeed support the failure of one training node, we could use:

deepspeed \
  --master_addr=rogpt1 \
  --elastic_training \
  --min_elastic_nodes=1 \
  --max_elastic_nodes=2 \
  --hostfile=hostfile \
  train.py \
  --deepspeed_config ds_config.json

But if one training node fails and later we want to relaunch it, it will fail to relaunch because it doesn’t have the checkpoint in the local directory. To solve this, there are two solutions:

Using a shared file system (Filestore of GCP, EFS of AWS, or just NFS) for the cluster and only letting the master node save the checkpoint. The saved checkpoint will be seen by all other nodes through the shared file system.
Or, just set “use_node_local_storage” to true. Then all the nodes will save the checkpoints.

{
   "steps_per_print": 2000,
   "checkpoint": {
     "use_node_local_storage": true
   },
   "elasticity": {
     "enabled": true,
     "micro_batch_sizes": [64,128,256],
     "max_train_batch_size": 1024
   },
   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 0.001,
       "betas": [
         0.8,
         0.999
       ],
       "eps": 1e-8,
       "weight_decay": 3e-7
     }
   },
   "scheduler": {
     "type": "WarmupLR",
     "params": {
       "warmup_min_lr": 0,
       "warmup_max_lr": 0.001,
       "warmup_num_steps": 1000
     }
   },
   "wall_clock_breakdown": false
}

Distributed Data-Parallel training of PyTorch

Let’s get to the point directly:

import os
import time

import torch
import torch.nn as nn
import torch.distributed as dist

from model import resnet152
from dataset import get_data_loaders
from torch.nn.parallel import DistributedDataParallel as DDP

learning_rate = 0.001
num_epochs = 40
momentum = 0.9
weight_decay = 1e-5


def setup():
    # initialize the process group
    dist.init_process_group("nccl")


def cleanup():
    dist.destroy_process_group()


def train(rank, world_size):
    setup()

    model = resnet152().to(rank)
    model = DDP(model)

    if rank == 0 and os.path.exists("last.pth"):
        obj = torch.load("last.pth")
        print(f"Rank{rank} load 'last.pth' with epoch: {obj['epoch']}")
        model.load_state_dict(obj["model"])
        begin = obj["epoch"] + 1
    else:
        begin = 0
    print(f"Rank{rank} begin at {begin}")

    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

    start = time.time()
    running_loss = 0
    trainloader, testloader = get_data_loaders(rank, world_size)

    for epoch in range(begin, num_epochs):
        trainloader.sampler.set_epoch(epoch)
        for index, (images, labels) in enumerate(trainloader):
            # gpu
            images, labels = images.to(rank), labels.to(rank)

            outputs = model(images)

            loss = criterion(outputs, labels)

            # backward and optimization
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        # train
        correct = 0
        total = 0
        with torch.no_grad():
            for data in trainloader:
                images, labels = data

                # gpu
                images, labels = images.to(rank), labels.to(rank)

                outputs = model(images)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
        trainset_accu = 100 * correct / total

        # test
        correct = 0
        total = 0
        with torch.no_grad():
            for data in testloader:
                images, labels = data
                # gpu
                images, labels = images.to(rank), labels.to(rank)
                outputs = model(images)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
        testset_accu = 100 * correct / total
        if rank == 0:
            print(
                f"[{epoch}] Accu: {trainset_accu:.2f}%, {testset_accu:.2f}% \
                    | {(time.time() - start)/60.0:.1f} mins, loss: {running_loss}"
            )
            torch.save(model.state_dict(), f"cifar100_{epoch}.pth")
            torch.save({"model": model.state_dict(), "epoch": epoch}, "last.pth")
        running_loss = 0.0

    end = time.time()
    stopWatch = end - start
    print("Training is done")
    print("Total Training Time (second):", stopWatch)
    cleanup()


if __name__ == "__main__":
    local_rank = int(os.environ["LOCAL_RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    train(local_rank, world_size)

The main training code comes from this notebook (really appreciate to @batuhan3526), and the code for the distributed environment is from here. I haven’t pasted the code for the dataset since this doc already gives sufficient introduction.

To run this snippet on two nodes (every node has two GPUs), I need to use the powerful “torchrun“:

torchrun \
  --rdzv-backend=c10d \
  --rdzv-endpoint=rogpt1:23456 \
  --nnodes=1:2 \
  --max-restarts=3 \
  --nproc-per-node=2 \
  train.py

For the above snippet, the Rank-0 process will save the checkpoint for each node. If one process fails, the whole cluster will restart and resume training from epoch + 1.

I tried letting only the Rank-0 process on node-0 save the checkpoint once. However since other nodes won’t have the checkpoint to load, the restart failed with a dead loop.

The Pub/Sub subscription problem

We have a project using Pub/Sub of Google Cloud. About one month ago, the pipeline failed because the subscription inexplicable disappeared. I suspected someone may mistakenly deleted it. However, after searching for a mount of log of GCP, nothing has been discovered. Without any clue, I just re-created the subscription.

Then comes the new year. The subscription disappeared again. This time, it looks like a system setting instead of a human mistake. Fortunately, we found this doc from GCP: https://cloud.google.com/knowledge/kb/pub-sub-subscriptions-disappeared-without-any-deletion-logs-000004170. When creating a subscription, the default “expiring” is just 31 days. This means this subscription will be deleted automatically if there is no message in it for 31 days.

For safety, we’d better create subscriptions with a longer “expiring”, or just “Never expire”.

Test of SegmentAnything Model (SAM)

Here is the original picture:

Months ago, I tested the segmentation of YOLOv8. The result is not very promising:

The tail of one monkey couldn’t be segmented correctly.

Today I tested the same picture with Meta’s Segment Anything Model (SAM). After using the simple Colab notebook and “vit_l” model type, the result shows better:

At least the yellow monkey’s tail has been all correctly segmented.

Note: For running the notebook with T4 GPU, you may need to set points_per_batch like:

SamAutomaticMaskGenerator(sam, points_per_batch=16)

How about running with another picture? The goats:

Now, even the SAM couldn’t segment all the horns of goats correctly.

Instant Segmentation by YOLOv8

If we want to use YOLOv8 for object detection, here is a good example.

What if I want to use YOLOv8 to segment a picture, crop out the object, and paste (only paste the object, not the pixels near it) it to a new picture? I wrote an example:

import cv2
import numpy as np

from ultralytics import YOLO

IMG_SIZE = 2048

def main(path):
    filename, file_extension = os.path.splitext(path)

    # Load segment model of yolov8
    model = YOLO("yolov8x-seg.pt")
    img = cv2.imread(path)

    results = model(img, imgsz=(img.shape[1], img.shape[0]))

    count = 0
    for res in results:
        for mask in res.masks.xy:
            polygan = mask.reshape((-1, 1, 2)).astype(np.int32)
            x, y, w, h = cv2.boundingRect(polygan)

            # Create mask with all value of 255
            binary_mask = np.ones(img.shape, dtype=np.uint8) * 255
            # Fill the polygan (the object we want) with zero
            cv2.fillPoly(binary_mask, [polygan], (0, 0, 0))
            # Add zero polygan with origin image could keep object, and push background to 255
            out_img = cv2.add(img, binary_mask)[y:y+h, x:x+w]

            cv2.imwrite(f"{filename}_{count}{file_extension}", out_img)
            count += 1
    print(f"Total: {count}")

if __name__ == "__main__":
    main(sys.argv[1])

Timezone in pods of Argo

Last week I noticed that the pod in Argo would give a UTC timezone even though the Argo configuration has set a AEDT timezone.

apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
  name: my-cron-workflow
  namespace: MYNAMESPACE
spec:
  schedule: "0 10 * * *"
  timezone: "Australia/Sydney"

Actually, this timezone in configuration is only used to give the time this scheduled job would run at, not the timezone in every pod.

To set up a unified timezone for all the pods, we need to use volume.

apiVersion: v1
kind: Pod
metadata:
  name: busybox-sleep
spec:
  containers:
  - name: busybox
    image: busybox
    args:
    - sleep
    - "1000000"
    volumeMounts:
    - name: tz-config
      mountPath: /etc/localtime
  volumes:
    - name: tz-config
      hostPath:
        path: /usr/share/zoneinfo/Australia/Sydney
        type: File

Install new driver for old Nvidia Tesla P100

I was trying to launch a VM instance with GPU on Google Cloud. But after trying T4, L4, and V100, they all reported “exceeding resource limit”, which means a lot of people in my region are using these types of GPUs.

Without choice, I launched a VM instance with an old Nvidia Tesla P100 (I first used it about 5 years ago). Then, I need to install its driver. But the installation process reported errors:

   *** Failed CC version check. ***

     SYMLINK /tmp/selfgz26389/NVIDIA-Linux-x86_64-515.105.01/kernel/nvidia/nv-kernel.o
     SYMLINK /tmp/selfgz26389/NVIDIA-Linux-x86_64-515.105.01/kernel/nvidia-modeset/nv-modeset-kernel.o
    CONFTEST: hash__remap_4k_pfn
    CONFTEST: set_pages_uc
    CONFTEST: list_is_first
    CONFTEST: set_memory_uc
...
	cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'

At first glance, I suspect the GCC compiler is too old. After downgrading the GCC to gcc-10 and gcc-9, the error still existed.

Finally, I noticed that the driver of the Tesla P100 is very new (Release Date: 2023.3.30) and this page mentioned “gcc-12”. Therefore I upgraded the GCC to 12:

sudo apt install gcc-12
sudo ln -sf /usr/bin/gcc-12 /etc/alternatives/cc

Now the driver can be installed successfully.

The experience of using Google Cloud’s Text-to-Speech AI

Just using the Python API of Text-to-Speech AI to transform a PDF file to mp3 audio, as the example:

from google.cloud import texttospeech
from PyPDF2 import PdfReader

client = texttospeech.TextToSpeechClient()

reader = PdfReader("xxx.pdf")

voice = texttospeech.VoiceSelectionParams(
    language_code="cmn-CN", name="cmn-CN-Wavenet-B", ssml_gender=texttospeech.SsmlVoiceGender.MALE
)

audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.MP3,
    speaking_rate=0.8,
)

text = ""
index = 1
# try first 10 pages
for page in reader.pages[:10]:
    text += page.extract_text()

print(len(text))
synthesis_input = texttospeech.SynthesisInput(text=text)
response = client.synthesize_speech(
    input=synthesis_input, voice=voice, audio_config=audio_config
)

with open("outout.mp3", "wb") as out:
    out.write(response.audio_content)
    print("Written")

Very simple, right? But it just reported an error:

google.api_core.exceptions.InvalidArgument: 400 Either `input.text` or `input.ssml` is longer than the limit of 5000 bytes. This limit is different from quotas. To fix, reduce the byte length of the characters in this request, or consider using the Long Audio API: https://cloud.google.com/text-to-speech/docs/create-audio-text-long-audio-synthesis.

It seems the request is too long. Let’s use the “Long Audio API”:

from google.cloud import texttospeech
from PyPDF2 import PdfReader

client = texttospeech.TextToSpeechLongAudioSynthesizeClient()

reader = PdfReader("xxx.pdf")

voice = texttospeech.VoiceSelectionParams(
    language_code="cmn-CN", name="cmn-CN-Wavenet-B", ssml_gender=texttospeech.SsmlVoiceGender.MALE
)

audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.LINEAR16,
    speaking_rate=0.8,
)

text = ""
index = 1
for page in reader.pages[:10]:
    text += page.extract_text()

print(len(text))
synthesis_input = texttospeech.SynthesisInput(text=text)
request = texttospeech.SynthesizeLongAudioRequest(
    parent="projects/robin-00000/locations/us",
    input=synthesis_input, voice=voice, audio_config=audio_config,
    output_gcs_uri="gs://robin_tts/xxx.mp3"
)

operation = client.synthesize_long_audio(request=request)
result = operation.result(timeout=300)
print(result)

It couldn’t work still:

google.api_core.exceptions.InvalidArgument: 400 The long audio API does not support the language zh. Supported languages: en, es.

Okay. It doesn’t support the Chinese language. Then, what should I do if I want to translate a Chinese pdf to mp3? Convert them page by page into 500 mp3 files? This is terrible. Even for the short mp3 it generated, it definitely sounds like a machine, not a human.

Google has the state-of-the-art technology of deep learning but some of their products in the cloud are ridiculously hard to use (such as Vertex AI, and this Text-to-Speech).

After some searching (at least Google search is perfect as before), I found this NaturalReader. Surprisingly, it supports the Chinese language and the voice is as well as a real human. The only problem is it is very expensive for individual users.

Using Javascript to load ONNX model for Object Detection job

Although having used the YOLOv5 model several times, I haven’t used its corresponding ONNX model before. This time, I met a use case to run its ONNX model using Javascript.

To learn and debug the code, I installed node.js and started my Javascript trip. This snippet helps me a lot to understand the Non-Max Suppression algorithm again (I used NMS algo many years). After studying it and also the implementation in the YOLOv5, I finally knocked together the Javascript code to use the ONNX model of YOLOv5 to detect objects:

https://github.com/RobinDong/onnx_js/blob/main/detect.js
Why the “YOLOV5S_CLASSES” is 85? That’s because in YOLO series it uses the first four floats as coordinates of boxes (center x, center y, width, height), the fifth as “object confidence”, and the left 80 floats as “class confidence”.

For anyone who is interested in it, you could use the below steps to run:

Install node.js on your computer or laptop
Run npm install fs inkjet onnxruntime-node to install js packages
Run node detect.js lorikeet.jpg to detect objects in the image “lorikeet.jpg” and output to file “output.txt”
Run python3 draw.py lorikeet.jpg to show the image of the result by using data in “output.txt”

Hope it could be useful for someone.

Accelerate augmentation of bird audio

audiomentions is a very convenient library for my bird sound classification. As the code below:

from audiomentations import Compose, AddGaussianNoise, AddGaussianSNR, TimeStretch, PitchShift

        self.augment = Compose([
            AddGaussianNoise(min_amplitude=0.005, max_amplitude=0.015, p=poss),
            AddGaussianSNR(min_snr_in_db=5.0, max_snr_in_db=40.0, p=poss),
            TimeStretch(min_rate=0.8, max_rate=1.2, p=poss),
            PitchShift(min_semitones=-2, max_semitones=2, p=poss)
        ])

These four augmentation methods are enough for current training. But the PitchShift method will cost a lot of CPU resources therefore the GPU couldn’t run to full load and the CPU usage jumps to 100%.

Failed to find an audio augmentation library that uses GPU, I started to check the source code of “audiomentions” and noticed that it uses librosa as its implementation:

        try:
            pitch_shifted_samples = librosa.effects.pitch_shift(
                samples, sr=sample_rate, n_steps=self.parameters["num_semitones"]
            )
        except librosa.util.exceptions.ParameterError:

Then the code of “librosa” for “pitch_shift”:

def pitch_shift(
    y: np.ndarray,
    *,
    sr: float,
    n_steps: float,
    bins_per_octave: int = 12,
    res_type: str = "soxr_hq",
    scale: bool = False,
    **kwargs: Any,
) -> np.ndarray:

The default “res_type” for “pitch_shift” is “soxr_hq”. This is a slow resource. After changing “it”res_type” to “linear” in “audiomentions”, the CPU usage jumps back to 50% on my desktop and the GPU ramps up to 100% when training.

—— 2023.07.28 ——

Thanks for the correction from Iver.

After I run this test snippet:

import time
import librosa

sound, sr = librosa.load("./song/background/AirportAnnouncements_1.wav")

for resource in [None, "linear", "soxr_hq", "kaiser_best"]:
    begin = time.time()
    for _ in range(10):
        if resource:
            librosa.effects.pitch_shift(sound, sr=sr, n_steps=1, res_type=resource)
        else:
            librosa.effects.pitch_shift(sound, sr=sr, n_steps=1)
    if resource:
        print(f"{resource} time:", time.time() - begin)
    else:
        print("default time:", time.time() - begin)

and got the result

default time: 8.455572366714478
linear time: 3.3037502765655518
soxr_hq time: 3.3474862575531006
kaiser_best time: 8.467342615127563

Iver is right: the soxr_hq is as fast as linear. And the actual default res_type of librosa which I was using is kaiser_best.