machine learning

Multimodal trials: my tiny CLIP implementation

CLIP is already a three years old paper but its simple design and significant performance still attracted me. After one week of programming and debugging, I finished v0.1-version of my tiny CLIP. It uses ConvNextV2 Nano and some parts of nanoGPT so both encoders will keep parameters of about 35millons.

The training dataset is CC3M downloaded by using the tool from img2dataset. The actual number of images is 2.3 million (might be my awful network environment). For the testing dataset, I use the 50000 val images of ImageNet1K.

I split the CC3M into 90% training and 10% validating. Just after one night of training (the electricity fee is much cheaper at night), the result seems too good to be true:

[Eval] loss: 0.2333 accuracy: 0.9257
[003 : 131000] loss: 0.5992 accu: 0.8281 lr: 1.0000e-06 time: 642.28
[004 : 132000] loss: 0.5567 accu: 0.7969 lr: 1.0000e-06 time: 198.91
[004 : 133000] loss: 0.4493 accu: 0.8750 lr: 1.0000e-06 time: 198.52
[004 : 134000] loss: 0.4729 accu: 0.8281 lr: 1.0000e-06 time: 199.15
[004 : 135000] loss: 0.5102 accu: 0.8281 lr: 1.0000e-06 time: 198.22

The accuracy in the 10% validating data is as high as 0.9257, which I guess is caused by this small dataset. The evaluation of 50000 val images of ImageNet1K shows that the top-5 accuracy is just 6.66%. This is far away even from the 2016 paper‘s 11.5% zero-shot accuracy.

Therefore, I will use CC12M in the next step.

There are also some questions I need to solve:

The original paper of CLIP uses L2 normalization on multimodal embeddings. But in my test, the model will not converge with it.
Adding the learnable temperature parameter will cause a training error, which requires a “retain_graph=True” argument for “backward()”
When using “torch.compile()”, it will report a Triton error after the first epoch

Wish me good luck.

Performance of Flash Attention and torch.compile()

I am trying to build a small repo about multi-modal models (CLIP, ALBEF, BLIP etc). The GPT code is mainly from nanoGPT. Then I became inquisitive about the performance of “Flash Attention” and “torch.compile()”.

The metrics with my original code (w/o Flash Attention, w/o torch.compile()):

[100] loss: 4.0315 time 23.7708
[200] loss: 4.0020 time 23.9010
[300] loss: 3.8115 time 23.9407
[400] loss: 3.7021 time 23.9785
[500] loss: 3.6626 time 24.0076
[600] loss: 3.7109 time 24.0060

The metrics after adding Flash Attention:

[100] loss: 4.1204 time 23.0655
[200] loss: 3.8950 time 23.2243
[300] loss: 3.9116 time 23.2714
[400] loss: 3.7837 time 23.2864
[500] loss: 3.8313 time 23.2993
[600] loss: 3.9138 time 23.3255

The metrics after adding Flash Attention and torch.compile()

[100] loss: 3.9969 time 14.8842                                                                                               
[200] loss: 3.8506 time 15.0004                                                                                               
[300] loss: 3.8702 time 15.0050                               
[400] loss: 3.7977 time 15.0061                                                                                               
[500] loss: 3.7374 time 15.0492       
[600] loss: 3.6589 time 15.0661

Seems “torch.compile()” is much more powerful than “Flash Attention”

Training CIFAR-100 by DeepSpeed

To let DeepSpeed support the failure of one training node, we could use:

deepspeed \
  --master_addr=rogpt1 \
  --elastic_training \
  --min_elastic_nodes=1 \
  --max_elastic_nodes=2 \
  --hostfile=hostfile \
  train.py \
  --deepspeed_config ds_config.json

But if one training node fails and later we want to relaunch it, it will fail to relaunch because it doesn’t have the checkpoint in the local directory. To solve this, there are two solutions:

Using a shared file system (Filestore of GCP, EFS of AWS, or just NFS) for the cluster and only letting the master node save the checkpoint. The saved checkpoint will be seen by all other nodes through the shared file system.
Or, just set “use_node_local_storage” to true. Then all the nodes will save the checkpoints.

{
   "steps_per_print": 2000,
   "checkpoint": {
     "use_node_local_storage": true
   },
   "elasticity": {
     "enabled": true,
     "micro_batch_sizes": [64,128,256],
     "max_train_batch_size": 1024
   },
   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 0.001,
       "betas": [
         0.8,
         0.999
       ],
       "eps": 1e-8,
       "weight_decay": 3e-7
     }
   },
   "scheduler": {
     "type": "WarmupLR",
     "params": {
       "warmup_min_lr": 0,
       "warmup_max_lr": 0.001,
       "warmup_num_steps": 1000
     }
   },
   "wall_clock_breakdown": false
}

Distributed Data-Parallel training of PyTorch

Let’s get to the point directly:

import os
import time

import torch
import torch.nn as nn
import torch.distributed as dist

from model import resnet152
from dataset import get_data_loaders
from torch.nn.parallel import DistributedDataParallel as DDP

learning_rate = 0.001
num_epochs = 40
momentum = 0.9
weight_decay = 1e-5


def setup():
    # initialize the process group
    dist.init_process_group("nccl")


def cleanup():
    dist.destroy_process_group()


def train(rank, world_size):
    setup()

    model = resnet152().to(rank)
    model = DDP(model)

    if rank == 0 and os.path.exists("last.pth"):
        obj = torch.load("last.pth")
        print(f"Rank{rank} load 'last.pth' with epoch: {obj['epoch']}")
        model.load_state_dict(obj["model"])
        begin = obj["epoch"] + 1
    else:
        begin = 0
    print(f"Rank{rank} begin at {begin}")

    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

    start = time.time()
    running_loss = 0
    trainloader, testloader = get_data_loaders(rank, world_size)

    for epoch in range(begin, num_epochs):
        trainloader.sampler.set_epoch(epoch)
        for index, (images, labels) in enumerate(trainloader):
            # gpu
            images, labels = images.to(rank), labels.to(rank)

            outputs = model(images)

            loss = criterion(outputs, labels)

            # backward and optimization
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        # train
        correct = 0
        total = 0
        with torch.no_grad():
            for data in trainloader:
                images, labels = data

                # gpu
                images, labels = images.to(rank), labels.to(rank)

                outputs = model(images)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
        trainset_accu = 100 * correct / total

        # test
        correct = 0
        total = 0
        with torch.no_grad():
            for data in testloader:
                images, labels = data
                # gpu
                images, labels = images.to(rank), labels.to(rank)
                outputs = model(images)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
        testset_accu = 100 * correct / total
        if rank == 0:
            print(
                f"[{epoch}] Accu: {trainset_accu:.2f}%, {testset_accu:.2f}% \
                    | {(time.time() - start)/60.0:.1f} mins, loss: {running_loss}"
            )
            torch.save(model.state_dict(), f"cifar100_{epoch}.pth")
            torch.save({"model": model.state_dict(), "epoch": epoch}, "last.pth")
        running_loss = 0.0

    end = time.time()
    stopWatch = end - start
    print("Training is done")
    print("Total Training Time (second):", stopWatch)
    cleanup()


if __name__ == "__main__":
    local_rank = int(os.environ["LOCAL_RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    train(local_rank, world_size)

The main training code comes from this notebook (really appreciate to @batuhan3526), and the code for the distributed environment is from here. I haven’t pasted the code for the dataset since this doc already gives sufficient introduction.

To run this snippet on two nodes (every node has two GPUs), I need to use the powerful “torchrun“:

torchrun \
  --rdzv-backend=c10d \
  --rdzv-endpoint=rogpt1:23456 \
  --nnodes=1:2 \
  --max-restarts=3 \
  --nproc-per-node=2 \
  train.py

For the above snippet, the Rank-0 process will save the checkpoint for each node. If one process fails, the whole cluster will restart and resume training from epoch + 1.

I tried letting only the Rank-0 process on node-0 save the checkpoint once. However since other nodes won’t have the checkpoint to load, the restart failed with a dead loop.

Test of SegmentAnything Model (SAM)

Here is the original picture:

Months ago, I tested the segmentation of YOLOv8. The result is not very promising:

The tail of one monkey couldn’t be segmented correctly.

Today I tested the same picture with Meta’s Segment Anything Model (SAM). After using the simple Colab notebook and “vit_l” model type, the result shows better:

At least the yellow monkey’s tail has been all correctly segmented.

Note: For running the notebook with T4 GPU, you may need to set points_per_batch like:

SamAutomaticMaskGenerator(sam, points_per_batch=16)

How about running with another picture? The goats:

Now, even the SAM couldn’t segment all the horns of goats correctly.

Instant Segmentation by YOLOv8

If we want to use YOLOv8 for object detection, here is a good example.

What if I want to use YOLOv8 to segment a picture, crop out the object, and paste (only paste the object, not the pixels near it) it to a new picture? I wrote an example:

import cv2
import numpy as np

from ultralytics import YOLO

IMG_SIZE = 2048

def main(path):
    filename, file_extension = os.path.splitext(path)

    # Load segment model of yolov8
    model = YOLO("yolov8x-seg.pt")
    img = cv2.imread(path)

    results = model(img, imgsz=(img.shape[1], img.shape[0]))

    count = 0
    for res in results:
        for mask in res.masks.xy:
            polygan = mask.reshape((-1, 1, 2)).astype(np.int32)
            x, y, w, h = cv2.boundingRect(polygan)

            # Create mask with all value of 255
            binary_mask = np.ones(img.shape, dtype=np.uint8) * 255
            # Fill the polygan (the object we want) with zero
            cv2.fillPoly(binary_mask, [polygan], (0, 0, 0))
            # Add zero polygan with origin image could keep object, and push background to 255
            out_img = cv2.add(img, binary_mask)[y:y+h, x:x+w]

            cv2.imwrite(f"{filename}_{count}{file_extension}", out_img)
            count += 1
    print(f"Total: {count}")

if __name__ == "__main__":
    main(sys.argv[1])

Using Javascript to load ONNX model for Object Detection job

Although having used the YOLOv5 model several times, I haven’t used its corresponding ONNX model before. This time, I met a use case to run its ONNX model using Javascript.

To learn and debug the code, I installed node.js and started my Javascript trip. This snippet helps me a lot to understand the Non-Max Suppression algorithm again (I used NMS algo many years). After studying it and also the implementation in the YOLOv5, I finally knocked together the Javascript code to use the ONNX model of YOLOv5 to detect objects:

https://github.com/RobinDong/onnx_js/blob/main/detect.js
Why the “YOLOV5S_CLASSES” is 85? That’s because in YOLO series it uses the first four floats as coordinates of boxes (center x, center y, width, height), the fifth as “object confidence”, and the left 80 floats as “class confidence”.

For anyone who is interested in it, you could use the below steps to run:

Install node.js on your computer or laptop
Run npm install fs inkjet onnxruntime-node to install js packages
Run node detect.js lorikeet.jpg to detect objects in the image “lorikeet.jpg” and output to file “output.txt”
Run python3 draw.py lorikeet.jpg to show the image of the result by using data in “output.txt”

Hope it could be useful for someone.

Accelerate augmentation of bird audio

audiomentions is a very convenient library for my bird sound classification. As the code below:

from audiomentations import Compose, AddGaussianNoise, AddGaussianSNR, TimeStretch, PitchShift

        self.augment = Compose([
            AddGaussianNoise(min_amplitude=0.005, max_amplitude=0.015, p=poss),
            AddGaussianSNR(min_snr_in_db=5.0, max_snr_in_db=40.0, p=poss),
            TimeStretch(min_rate=0.8, max_rate=1.2, p=poss),
            PitchShift(min_semitones=-2, max_semitones=2, p=poss)
        ])

These four augmentation methods are enough for current training. But the PitchShift method will cost a lot of CPU resources therefore the GPU couldn’t run to full load and the CPU usage jumps to 100%.

Failed to find an audio augmentation library that uses GPU, I started to check the source code of “audiomentions” and noticed that it uses librosa as its implementation:

        try:
            pitch_shifted_samples = librosa.effects.pitch_shift(
                samples, sr=sample_rate, n_steps=self.parameters["num_semitones"]
            )
        except librosa.util.exceptions.ParameterError:

Then the code of “librosa” for “pitch_shift”:

def pitch_shift(
    y: np.ndarray,
    *,
    sr: float,
    n_steps: float,
    bins_per_octave: int = 12,
    res_type: str = "soxr_hq",
    scale: bool = False,
    **kwargs: Any,
) -> np.ndarray:

The default “res_type” for “pitch_shift” is “soxr_hq”. This is a slow resource. After changing “it”res_type” to “linear” in “audiomentions”, the CPU usage jumps back to 50% on my desktop and the GPU ramps up to 100% when training.

—— 2023.07.28 ——

Thanks for the correction from Iver.

After I run this test snippet:

import time
import librosa

sound, sr = librosa.load("./song/background/AirportAnnouncements_1.wav")

for resource in [None, "linear", "soxr_hq", "kaiser_best"]:
    begin = time.time()
    for _ in range(10):
        if resource:
            librosa.effects.pitch_shift(sound, sr=sr, n_steps=1, res_type=resource)
        else:
            librosa.effects.pitch_shift(sound, sr=sr, n_steps=1)
    if resource:
        print(f"{resource} time:", time.time() - begin)
    else:
        print("default time:", time.time() - begin)

and got the result

default time: 8.455572366714478
linear time: 3.3037502765655518
soxr_hq time: 3.3474862575531006
kaiser_best time: 8.467342615127563

Iver is right: the soxr_hq is as fast as linear. And the actual default res_type of librosa which I was using is kaiser_best.

Hanging of PyTorch’s data loader

Long story short. I am trying to build a Siamese network for audio classification. For 50% possibility, the “dataset.py” will try to find a pair of audios in the same category but with different files (also, different category for another 50% possibility). But when the evaluating start, it will hang after fetching a few batches. The trace could be see:

Traceback (most recent call last):                                                                                                                                                                                                        
  File "/home/robin/song/birdclef/old_train.py", line 395, in <module>                                                
    train(args, train_loader, eval_loader)                                                                                                                                                                                                  
  File "/home/robin/song/birdclef/old_train.py", line 280, in train                                                   
    accuracy = evaluate(args, net, eval_loader)                                                                                                                                                                                             
  File "/home/robin/song/birdclef/old_train.py", line 91, in evaluate                                                 
    sounds1, sounds2, type_ids = next(batch_iterator)                                                                 
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()                                                                                                                                                                                                                
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
    idx, data = self._get_data()                                                                                      
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1285, in _get_data                                                                                                              
    success, data = self._try_get_data()                                                                                                                                                                                                    
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    data = self._data_queue.get(timeout=timeout)                                                                      
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/queue.py", line 180, in get                                   
    self.not_empty.wait(remaining)                                                                                    
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/threading.py", line 324, in wait                              
    gotit = waiter.acquire(True, timeout)                                                                                                                                                                                                   
KeyboardInterrupt

As usual, I start with suspection of PyTorch. Is the version of PyTorch too new (2.0) that it includes some flaws? Then I quickly rejected my thoughts: if it’s the problem of PyTorch, why it didn’t meet same situation when not using Siamese network?

Then I found this issue in PyTorch GitHub page. It pointed to the clue: the new code in “dataset.py”. Now I notice the problem in my code:

            arr = self.cat_map[ebird_code]
            pair_wav_name = np.random.choice(arr)
            while pair_wav_name == wav_name:
                pair_wav_name = np.random.choice(arr)
            pair_sound = self.get_sound(pair_wav_name, ebird_code)

If a category only have one file, this loop will continue forever. This is the reason of the hang.

The solution is simple:

            arr = self.cat_map[ebird_code]
            if len(arr) > 1:
                pair_wav_name = np.random.choice(arr)
                while pair_wav_name == wav_name:
                    pair_wav_name = np.random.choice(arr)
            else:
                pair_wav_name = wav_name
            pair_sound = self.get_sound(pair_wav_name, ebird_code)

A powerful tool to monitor details of Intel CPU

In the research of PCIE 3.0 versus PCIE 4.0, I became serious about the actual application scenario. What’s the real bandwidth between CPU and GPU when we are training a deep learning model?

Finally, I got this tool: pcm

After building it, I run “sudo ./bin/pcm” and got this:

Grateful that I can even see the IPC(Instructions Per Cycle), and L2/L3 hit ratio from this tool. But my most interesting metric is the PCIE bandwidth. Where is the PCIE bandwidth?

I tried “sudo bin/pcm-pcie” but it said my desktop CPU (i5-12400) is not supported:

The processor is not susceptible to Rogue Data Cache Load: yes
The processor supports enhanced IBRS                     : yes
Package thermal spec power: 65 Watt; Package minimum power: 0 Watt; Package maximum power: 0 Watt;

INFO: Linux perf interface to program uncore PMUs is present

For non-CSV mode delay < 1.0s does not make a lot of practical sense. Default delay 1s is used. Consider to use CSV mode for lower delay values
Update every 1 seconds

Detected 12th Gen Intel(R) Core(TM) i5-12400 "Intel(r) microarchitecture codename Alder Lake" stepping 5 microcode level 0x2c
Jaketown, Ivytown, Haswell, Broadwell-DE, Skylake, Icelake, Snowridge and Sapphirerapids Server CPU is required for this tool! Program aborted
Cleaning up
 Closed perf event handles
 Zeroed uncore PMU registers

Then a new idea jumped out of my mind: what my CPU do in my application is only read data from file and push them to GPU, so the bandwidth of reading memory is approximately the writing bandwidth of PCIE!

To verify my idea, I changed my model from “tf_efficientnetv2_s_in21k” to “tf_mobilenetv3_small_075” (using a smaller model could let CPU pump more data into GPU)

As we can see, the bandwidth of READ memory increased from “1.36GB” to “13.69GB”. This shall be equal to the bandwidth of PCIe (since the data from memory will only go to the GPU).

Seems we really need PCIE 4.0 for deep learning 🙂