PyTorch

Experiments about ‘torchao’

‘torchao‘ is a python library that support PyTorch native quantization and sparsity for training and inference. I just finished some experiments/tests with it for my image-classification project, which use CNN model by PyTorch. Below are some conclusions.

My project already used Automatic Mixed Precision of ‘bfloat16’, but the convert_to_float8_training still easily reduce about 60% of the VRAM (on my RTX 4090 GPU):

from torchao.float8 import convert_to_float8_training

def module_filter_fn(mod: torch.nn.Module, fqn: str) -> bool:
    # Example: Exclude the output layer from float8 conversion
    if fqn == "output":
        return False
    # Example: Exclude linear layers with dimensions not divisible by 16
    if isinstance(mod, torch.nn.Linear):
        if mod.in_features % 16 != 0 or mod.out_features % 16 != 0:
            return False
    return True

convert_to_float8_training(m, module_filter_fn=module_filter_fn)



AdamW8bit could decrease the VRAM from 22.6GB to 22.4GB, not too much.

Didn't see any VRAM difference after using CPUOffloadOptimizer. Since it couldn't work well with learning-rate-scheduler. I tend to give up it.

Notes and experiences from Audio Classification research

All the code is here.

The baseline of training balanced data of AudioSet is 0.27 mAP. Using TimeMasking and FrequentMasking could slightly push it to 0.28 mAP.

I tried mixup of raw sounds like AST but it didn’t improve the mAP totally (the reason is still a myth for me). But, the mixup of fbank filters could push metric to 0.293 mAP.

Until then, the fbank filter will be resized to (384, 384) for model deit_distilled. After I recovered the size of fbank filter to (128, 998), it reached 0.323 mAP.

The most recent (hope it’s not the last) change is copied wholly from AST: use the pretrained parameters of Conv2D from deit_distilled but change the stride size — also expand the position embeddings since the sequence length has changed. The result is 0.333 mAP.

It is worth noting that this is the first time I feel the power of pretrained model by my hand. If I re-initialized parameters of position embedings instead of “bilinear” interpolating it, the result will be far away from 0.333 mAP. Also if I used new initialized parameters of the Conv2D (first layer for Vision Transformer), the result is as bad as before.

I will take care of whehter pretrained model also works well for unbalanced data of AduioSet

Multimodal trials: solve the Masked Language problem about my tiny ALBEF implementation (episode 3)

I just wrote my implementation of ALBEF in my own way. But when evaluated with some masked sentences, it failed.

I am using this image:

When I asked “This is a chocolate <|mask|>”, it generated “This is a chocolate urn”. Quite strange

Then I asked “This is a <|mask|> cake, it generated “This is a iph cake”. Totally wrong.

After checking my implementation of the dataset, and training on a small part of CC3M, a week passed and I finally got the reason today: the tiktoken is a BPE tokenizer that will use sub-words as tokens and these sub-words severely hurt the model. For example, sub-words “urn” and “iph” appear too many times and the model would use them to replace the masked word in prediction.

By replacing tiktoken with BertTokenizerFast (from “transformers” package), the model correctly generates “This is a chocolate cake”.

Multimodal trials: my tiny CLIP implementation (episode 2)

Three weeks passed since the previous article. Here are the answers to the previous three questions:

Q1: The original paper of CLIP uses L2 normalization on multimodal embeddings. But in my test, the model will not converge with it.

Answer 1: The reason the model didn’t converge is that the learning rate is too large. After reducing the learning rate a bit and adding the L2 normalization, the model could get above 80% validation accuracy. The L2 normalization essentially projects embeddings to a high-dimensional sphere with a one-unit radius, which by intuitive could regularize the model.

Q2: Adding the learnable temperature parameter will cause a training error, which requires a “retain_graph=True” argument for “backward()”

Answer 2: If I use the code

def __init__(self):
  self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07)).exp()

def forward(self):
  ...
  logits_per_image = self.logit_scale * img_embds @ txt_embds.T

it will report the error

Traceback (most recent call last):
  File "/home/robin/code/try_multimodal/train.py", line 196, in <module>
    trainer.train(args)
  File "/home/robin/code/try_multimodal/train.py", line 149, in train
    train_result = self.train_loop(cmodel, optimizer)
  File "/home/robin/code/try_multimodal/train.py", line 81, in train_loop
    self.scaler.scale(loss).backward()
  File "/home/robin/miniconda3/envs/nanoGPT/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/home/robin/miniconda3/envs/nanoGPT/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

But if I moved the “exp()” to “forward()”:

def __init__(self):
  self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))

def forward(self):
  ...
  logits_per_image = self.logit_scale.exp() * img_embds @ txt_embds.T

It works well. The reason is that “exp()” would bring the gradient to “logit_scale” so we’d better to let it in “forward()” to avoid graph edge duplication.

Q3: When using “torch.compile()”, it will report a Triton error after the first epoch

Answer 3: Seems the “torch.compile()” needs the input batch to be fixed shape, which means you’d better not change the batch_size at the training step. To avoid this, I dropped the last batch of the dataset since the last batch usually wouldn’t have enough samples for BATCH_SIZE.

There is a new discovery for “torch.compile()”. Yesterday I was trying to compile the model (a self-implement ALBEF) but came an error:

...
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
InvalidCxxCompiler: No working C++ compiler found in torch._inductor.config.cpp.cxx: (None, 'g++')

After tumbling in the debug document of PyTorch compiling, I finally found out that the solution is just installed “g++” on my computer…

Previously, the evaluation of 50000 val images of ImageNet1K shows that the top-5 accuracy is just 6.66%. After I added CC12M with CC3M as training dataset, the top-5 evaluation accuracy raised to 23.76%. My tiny CLIP model checkpoint is here.

Does sinusoid Positional Embeddings actually work well?

The GPT part of my Multimodal trials mainly comes from nanoGPT. In the nanoGPT, the Positional Encoding is just a learnable tensor (“wpe” means “weights of positional embedding”):

self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            drop = nn.Dropout(config.dropout),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = LayerNorm(config.n_embd, bias=config.bias),
        ))

It’s different from the implementation of the original paper. The original paper mentioned:

We also experimented with using learned positional embeddings instead, and found that the two versions produced nearly identical results.

The “vanilla” Positional Embeddings for the transformer are two functions:

$PE_(pos,2i) = sin(pos/10000^{2i/d_{model}})$

$PE_(pos,2i+1) = cos(pos/10000^{2i/d_{model}})$

Which ones work better in the model training? Let me try running “python train.py config/train_shakespeare_char.py” in nanoGPT and get the best validation loss as metrics.

I wrote my own sinusoid Positional Embeddings for testing:

class GPT(nn.Module):
  def __init__(self, config):
	...
    # Position Embedding from original Transformer paper
    divisor = torch.pow(
        10000, 2 * torch.arange(1, config.n_embd + 1) / config.n_embd
    )
    pe = []
    for pos in range(1, config.block_size + 1):
        if pos % 2 == 0:
            pe.append(torch.sin(pos / divisor).unsqueeze(0))
        else:
            pe.append(torch.cos(pos / divisor).unsqueeze(0))
    self.register_buffer("pos_emb", torch.cat(pe, 0))

The “10000” (let’s call it “base number” for convenience) looks too big for a shorter sequence length, so I do experiments by changing it to “block_size” “2*block_size” etc.

The testing result:

	validation loss
Original nanoGPT	1.4754
Base number: 10000	1.4959
Base number: 4 * block_size	1.4916
Base number: 2 * block_size	1.4995
Base number: 3.14/2 * block_size	1.4870
Base number: block_size	1.4947

From my simple tests, the learnable Positional Embeddings has the best effort. nanoGPT wins this round.

I have a guess about why the author of Transformer chose “10000”. The smallest “pos” is 1 and the biggest $2i/d_{model}$ is 2. Therefore the smallest value in sin() is $1/10000^2=1e-8$ , which is very close to the minimal value of FLOAT16 $5.96e-8$

Multimodal trials: my tiny CLIP implementation

CLIP is already a three years old paper but its simple design and significant performance still attracted me. After one week of programming and debugging, I finished v0.1-version of my tiny CLIP. It uses ConvNextV2 Nano and some parts of nanoGPT so both encoders will keep parameters of about 35millons.

The training dataset is CC3M downloaded by using the tool from img2dataset. The actual number of images is 2.3 million (might be my awful network environment). For the testing dataset, I use the 50000 val images of ImageNet1K.

I split the CC3M into 90% training and 10% validating. Just after one night of training (the electricity fee is much cheaper at night), the result seems too good to be true:

[Eval] loss: 0.2333 accuracy: 0.9257
[003 : 131000] loss: 0.5992 accu: 0.8281 lr: 1.0000e-06 time: 642.28
[004 : 132000] loss: 0.5567 accu: 0.7969 lr: 1.0000e-06 time: 198.91
[004 : 133000] loss: 0.4493 accu: 0.8750 lr: 1.0000e-06 time: 198.52
[004 : 134000] loss: 0.4729 accu: 0.8281 lr: 1.0000e-06 time: 199.15
[004 : 135000] loss: 0.5102 accu: 0.8281 lr: 1.0000e-06 time: 198.22

The accuracy in the 10% validating data is as high as 0.9257, which I guess is caused by this small dataset. The evaluation of 50000 val images of ImageNet1K shows that the top-5 accuracy is just 6.66%. This is far away even from the 2016 paper‘s 11.5% zero-shot accuracy.

Therefore, I will use CC12M in the next step.

There are also some questions I need to solve:

The original paper of CLIP uses L2 normalization on multimodal embeddings. But in my test, the model will not converge with it.
Adding the learnable temperature parameter will cause a training error, which requires a “retain_graph=True” argument for “backward()”
When using “torch.compile()”, it will report a Triton error after the first epoch

Wish me good luck.

Performance of Flash Attention and torch.compile()

I am trying to build a small repo about multi-modal models (CLIP, ALBEF, BLIP etc). The GPT code is mainly from nanoGPT. Then I became inquisitive about the performance of “Flash Attention” and “torch.compile()”.

The metrics with my original code (w/o Flash Attention, w/o torch.compile()):

[100] loss: 4.0315 time 23.7708
[200] loss: 4.0020 time 23.9010
[300] loss: 3.8115 time 23.9407
[400] loss: 3.7021 time 23.9785
[500] loss: 3.6626 time 24.0076
[600] loss: 3.7109 time 24.0060

The metrics after adding Flash Attention:

[100] loss: 4.1204 time 23.0655
[200] loss: 3.8950 time 23.2243
[300] loss: 3.9116 time 23.2714
[400] loss: 3.7837 time 23.2864
[500] loss: 3.8313 time 23.2993
[600] loss: 3.9138 time 23.3255

The metrics after adding Flash Attention and torch.compile()

[100] loss: 3.9969 time 14.8842                                                                                               
[200] loss: 3.8506 time 15.0004                                                                                               
[300] loss: 3.8702 time 15.0050                               
[400] loss: 3.7977 time 15.0061                                                                                               
[500] loss: 3.7374 time 15.0492       
[600] loss: 3.6589 time 15.0661

Seems “torch.compile()” is much more powerful than “Flash Attention”

Training CIFAR-100 by DeepSpeed

To let DeepSpeed support the failure of one training node, we could use:

deepspeed \
  --master_addr=rogpt1 \
  --elastic_training \
  --min_elastic_nodes=1 \
  --max_elastic_nodes=2 \
  --hostfile=hostfile \
  train.py \
  --deepspeed_config ds_config.json

But if one training node fails and later we want to relaunch it, it will fail to relaunch because it doesn’t have the checkpoint in the local directory. To solve this, there are two solutions:

Using a shared file system (Filestore of GCP, EFS of AWS, or just NFS) for the cluster and only letting the master node save the checkpoint. The saved checkpoint will be seen by all other nodes through the shared file system.
Or, just set “use_node_local_storage” to true. Then all the nodes will save the checkpoints.

{
   "steps_per_print": 2000,
   "checkpoint": {
     "use_node_local_storage": true
   },
   "elasticity": {
     "enabled": true,
     "micro_batch_sizes": [64,128,256],
     "max_train_batch_size": 1024
   },
   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 0.001,
       "betas": [
         0.8,
         0.999
       ],
       "eps": 1e-8,
       "weight_decay": 3e-7
     }
   },
   "scheduler": {
     "type": "WarmupLR",
     "params": {
       "warmup_min_lr": 0,
       "warmup_max_lr": 0.001,
       "warmup_num_steps": 1000
     }
   },
   "wall_clock_breakdown": false
}

Distributed Data-Parallel training of PyTorch

Let’s get to the point directly:

import os
import time

import torch
import torch.nn as nn
import torch.distributed as dist

from model import resnet152
from dataset import get_data_loaders
from torch.nn.parallel import DistributedDataParallel as DDP

learning_rate = 0.001
num_epochs = 40
momentum = 0.9
weight_decay = 1e-5


def setup():
    # initialize the process group
    dist.init_process_group("nccl")


def cleanup():
    dist.destroy_process_group()


def train(rank, world_size):
    setup()

    model = resnet152().to(rank)
    model = DDP(model)

    if rank == 0 and os.path.exists("last.pth"):
        obj = torch.load("last.pth")
        print(f"Rank{rank} load 'last.pth' with epoch: {obj['epoch']}")
        model.load_state_dict(obj["model"])
        begin = obj["epoch"] + 1
    else:
        begin = 0
    print(f"Rank{rank} begin at {begin}")

    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

    start = time.time()
    running_loss = 0
    trainloader, testloader = get_data_loaders(rank, world_size)

    for epoch in range(begin, num_epochs):
        trainloader.sampler.set_epoch(epoch)
        for index, (images, labels) in enumerate(trainloader):
            # gpu
            images, labels = images.to(rank), labels.to(rank)

            outputs = model(images)

            loss = criterion(outputs, labels)

            # backward and optimization
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        # train
        correct = 0
        total = 0
        with torch.no_grad():
            for data in trainloader:
                images, labels = data

                # gpu
                images, labels = images.to(rank), labels.to(rank)

                outputs = model(images)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
        trainset_accu = 100 * correct / total

        # test
        correct = 0
        total = 0
        with torch.no_grad():
            for data in testloader:
                images, labels = data
                # gpu
                images, labels = images.to(rank), labels.to(rank)
                outputs = model(images)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
        testset_accu = 100 * correct / total
        if rank == 0:
            print(
                f"[{epoch}] Accu: {trainset_accu:.2f}%, {testset_accu:.2f}% \
                    | {(time.time() - start)/60.0:.1f} mins, loss: {running_loss}"
            )
            torch.save(model.state_dict(), f"cifar100_{epoch}.pth")
            torch.save({"model": model.state_dict(), "epoch": epoch}, "last.pth")
        running_loss = 0.0

    end = time.time()
    stopWatch = end - start
    print("Training is done")
    print("Total Training Time (second):", stopWatch)
    cleanup()


if __name__ == "__main__":
    local_rank = int(os.environ["LOCAL_RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    train(local_rank, world_size)

The main training code comes from this notebook (really appreciate to @batuhan3526), and the code for the distributed environment is from here. I haven’t pasted the code for the dataset since this doc already gives sufficient introduction.

To run this snippet on two nodes (every node has two GPUs), I need to use the powerful “torchrun“:

torchrun \
  --rdzv-backend=c10d \
  --rdzv-endpoint=rogpt1:23456 \
  --nnodes=1:2 \
  --max-restarts=3 \
  --nproc-per-node=2 \
  train.py

For the above snippet, the Rank-0 process will save the checkpoint for each node. If one process fails, the whole cluster will restart and resume training from epoch + 1.

I tried letting only the Rank-0 process on node-0 save the checkpoint once. However since other nodes won’t have the checkpoint to load, the restart failed with a dead loop.

Intel extension for PyTorch

Trying to test the Intel extension for PyTorch in my project, but it reported errors:

Traceback (most recent call last):                                                                                                                                                                                                          
  File "reviewjpgs_optimaztion_testing.py", line 27, in <module>                                                                                                                                                                            
    import intel_extension_for_pytorch as ipex                                                                                                                                                                                              
  File "/home/hero/.pyenv/versions/3.8.12/lib/python3.8/site-packages/intel_extension_for_pytorch/__init__.py", line 11, in <module>                                                                                                        
    from .cpu import _cpu_isa                                                                                                                                                                                                               
  File "/home/hero/.pyenv/versions/3.8.12/lib/python3.8/site-packages/intel_extension_for_pytorch/cpu/__init__.py", line 1, in <module>                                                                                                     
    from . import runtime                                                                                                                                                                                                                   
  File "/home/hero/.pyenv/versions/3.8.12/lib/python3.8/site-packages/intel_extension_for_pytorch/cpu/runtime/__init__.py", line 3, in <module>                                                                                             
    from .multi_stream import MultiStreamModule, get_default_num_streams, \
  File "/home/hero/.pyenv/versions/3.8.12/lib/python3.8/site-packages/intel_extension_for_pytorch/cpu/runtime/multi_stream.py", line 4, in <module>
    import intel_extension_for_pytorch._C as core
ImportError: /home/hero/.pyenv/versions/3.8.12/lib/python3.8/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-cpu.so: undefined symbol: _ZNK3c1010TensorImpl22is_strides_like_customENS_12MemoryFormatE

The answer is quite tricky: need to install the IPEX package with the same version of PyTorch.

After the testing of both torch.jit.trace and this IPEX, we found out that `torch.jit.trace` could boost the performance of prediction significantly but IPEX could not.