Robin on Linux – Page 2 – All about technology

Notes and experiences from Audio Classification research

All the code is here.

The baseline of training balanced data of AudioSet is 0.27 mAP. Using TimeMasking and FrequentMasking could slightly push it to 0.28 mAP.

I tried mixup of raw sounds like AST but it didn’t improve the mAP totally (the reason is still a myth for me). But, the mixup of fbank filters could push metric to 0.293 mAP.

Until then, the fbank filter will be resized to (384, 384) for model deit_distilled. After I recovered the size of fbank filter to (128, 998), it reached 0.323 mAP.

The most recent (hope it’s not the last) change is copied wholly from AST: use the pretrained parameters of Conv2D from deit_distilled but change the stride size — also expand the position embeddings since the sequence length has changed. The result is 0.333 mAP.

It is worth noting that this is the first time I feel the power of pretrained model by my hand. If I re-initialized parameters of position embedings instead of “bilinear” interpolating it, the result will be far away from 0.333 mAP. Also if I used new initialized parameters of the Conv2D (first layer for Vision Transformer), the result is as bad as before.

I will take care of whehter pretrained model also works well for unbalanced data of AduioSet

Augmentation helps ALBEF a lot

I was trying to implement ALBEF by myself for practice. After finishing all the parts (Vision part, BERT part, including Masked Language Model), I trained the model on COCO-Captions/SBU-Captions/CC3M/CC12M dataset (actually more data than the original paper). But the result is quite weird. An old steam train was recognised as a building, and a few fish were recognised as statues.

To solve these weird mistakes, I reviewed the code many times and finally noticed a sentence in the paper:

Although it’s just a normal sentence in the paper, the augmentation could improve the ALBEF model significantly. After randomly cropping the 256×256 raw image to 224×224 and also using the RandAugment, I finally got a more stable and suitable model. Let’s see some examples:

Previously, the fish had been recognised as “shoes”, and the bedroom as “city”. They all become very well after augmentation.

But there are still some interesting bad cases:

Adding a prefix of “A picture of” could help the ALBEF model improve its recognition capability, or actually, there is a lot of text like “A picture of XXX” in the CC3M or CC12m dataset.

Anyhow, I finally implemented and trained a workable ALBEF model by myself, and my RTX-3080Ti card.

Strange problem in Python Client of Vertex AI

To create a pipeline schedule of Vertex AI, we can use below snippet:

from google.cloud import aiplatform

pipeline_job = aiplatform.PipelineJob(
  template_path="COMPILED_PIPELINE_PATH",
  pipeline_root="PIPELINE_ROOT_PATH",
  display_name="DISPLAY_NAME",
)

pipeline_job_schedule = pipeline_job.create_schedule(
  display_name="SCHEDULE_NAME",
  cron="TZ=CRON",
  max_concurrent_run_count=MAX_CONCURRENT_RUN_COUNT,
  max_run_count=MAX_RUN_COUNT,
  service_account="XYZ",
)

This Python code runs with service account “XYZ” and we also want the schedule to run as service account “XYZ”. Make sense, right? But the execution throws errors:

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.INVALID_ARGUMENT
	details = "You do not have permission to act as service_account: vertex-runner@pers-decision-engine-dev.iam.gserviceaccount.com. (or it may not exist)."
	debug_error_string = "UNKNOWN:Error received from peer ipv4:74.125.201.95:443 {created_time:"2024-06-06T01:51:02.837225888+00:00", grpc_status:3, grpc_message:"You do not have permission to act as service_account: vertex-runner@pers-decision-engine-dev.iam.gserviceaccount.com. (or it may not exist)."}"

Why does the Python Client of Vertex AI need to “act as” service account “XYZ” even if it’s already using default service account “XYZ”? I can’t answer. Fortunately, the solution is adding a role “Service Account User” to the service account “XYZ” (as this shows)

Seems Google Cloud still need to do a few works to let Vertex AI work very well.

We are supposed to learn like the Large Language Model

When joined the Junior Middle school in a small town in west-south China in 1993, I met my first English book. Yes, it looks exactly like this:

Then, the terrible 6 years of Chinese-style-English-learning started. For normal kids in poor regions of China, the only way to learn a new language is to REMEMBER IT. For the vocabulary, I remembered all of them by writing them in the draft paper again and again. For the phonogram, I remembered all of them by writing them in the draft paper again and again. For the grammar, I remembered all of them by —- Wait a minute. Why did I need to remember the grammar? Because the English examination will test them. That’s the only purpose of learning English, not to read foreign stories or know the world, but to get a higher examination score.

For the six years of middle school, I spent more than 60 per cent of my hard work time on English (by hardly remembering phonograms and grammar) and still got inferior results: I couldn’t read a long English story, couldn’t recognize a lot of common English words, and couldn’t even write a decent article. All I learned was just some basic English words and some useless grammar.

Only when I went to the University and started to read a 300-page English-Reading-and-Understanding book. Yes, still for the examination, but at least no need to recite those stupid grammar or phonograms. I finally noticed that I could improve my English by just reading books.

The time zips by. I got my Kindle (yes, that electric paper device) in August 2011 (when I was 31 years old) and finished reading my long-desired but first long English story “Jurassic Park”. Since then, I started to read a lot of English books: “The Lost World”, “The Wild Wheel”, “The Swiss Family Robinson”, and all of the best “A Song of Ice and Fire”. I feel happy when reading English books, and my English skills improve as well. Happy learning, that’s the result of reading books.

So my conclusion is: if you want to learn English well, don’t try to remember those boring grammar, just read books, a lot of books.

Doesn’t this sound familiar? Yes, it sounds just like “The Bitter Lesson“, or the Scaling Law for machine learning. A Large Language Model doesn’t need to learn the grammar or go to school. It only needs to read a lot of books and articles (training on a large amount of corpus).

The LLM learns like a human, and I think I can also learn from it: reading a lot is already enough for learning.

Multimodal trials: solve the Masked Language problem about my tiny ALBEF implementation (episode 3)

I just wrote my implementation of ALBEF in my own way. But when evaluated with some masked sentences, it failed.

I am using this image:

When I asked “This is a chocolate <|mask|>”, it generated “This is a chocolate urn”. Quite strange

Then I asked “This is a <|mask|> cake, it generated “This is a iph cake”. Totally wrong.

After checking my implementation of the dataset, and training on a small part of CC3M, a week passed and I finally got the reason today: the tiktoken is a BPE tokenizer that will use sub-words as tokens and these sub-words severely hurt the model. For example, sub-words “urn” and “iph” appear too many times and the model would use them to replace the masked word in prediction.

By replacing tiktoken with BertTokenizerFast (from “transformers” package), the model correctly generates “This is a chocolate cake”.

Multimodal trials: my tiny CLIP implementation (episode 2)

Three weeks passed since the previous article. Here are the answers to the previous three questions:

Q1: The original paper of CLIP uses L2 normalization on multimodal embeddings. But in my test, the model will not converge with it.

Answer 1: The reason the model didn’t converge is that the learning rate is too large. After reducing the learning rate a bit and adding the L2 normalization, the model could get above 80% validation accuracy. The L2 normalization essentially projects embeddings to a high-dimensional sphere with a one-unit radius, which by intuitive could regularize the model.

Q2: Adding the learnable temperature parameter will cause a training error, which requires a “retain_graph=True” argument for “backward()”

Answer 2: If I use the code

def __init__(self):
  self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07)).exp()

def forward(self):
  ...
  logits_per_image = self.logit_scale * img_embds @ txt_embds.T

it will report the error

Traceback (most recent call last):
  File "/home/robin/code/try_multimodal/train.py", line 196, in <module>
    trainer.train(args)
  File "/home/robin/code/try_multimodal/train.py", line 149, in train
    train_result = self.train_loop(cmodel, optimizer)
  File "/home/robin/code/try_multimodal/train.py", line 81, in train_loop
    self.scaler.scale(loss).backward()
  File "/home/robin/miniconda3/envs/nanoGPT/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/home/robin/miniconda3/envs/nanoGPT/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

But if I moved the “exp()” to “forward()”:

def __init__(self):
  self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))

def forward(self):
  ...
  logits_per_image = self.logit_scale.exp() * img_embds @ txt_embds.T

It works well. The reason is that “exp()” would bring the gradient to “logit_scale” so we’d better to let it in “forward()” to avoid graph edge duplication.

Q3: When using “torch.compile()”, it will report a Triton error after the first epoch

Answer 3: Seems the “torch.compile()” needs the input batch to be fixed shape, which means you’d better not change the batch_size at the training step. To avoid this, I dropped the last batch of the dataset since the last batch usually wouldn’t have enough samples for BATCH_SIZE.

There is a new discovery for “torch.compile()”. Yesterday I was trying to compile the model (a self-implement ALBEF) but came an error:

...
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
InvalidCxxCompiler: No working C++ compiler found in torch._inductor.config.cpp.cxx: (None, 'g++')

After tumbling in the debug document of PyTorch compiling, I finally found out that the solution is just installed “g++” on my computer…

Previously, the evaluation of 50000 val images of ImageNet1K shows that the top-5 accuracy is just 6.66%. After I added CC12M with CC3M as training dataset, the top-5 evaluation accuracy raised to 23.76%. My tiny CLIP model checkpoint is here.

“No space left on device” problem I met on ext4

After downloading the whole CC12M dataset from Huggingface, I wrote a tool to extract all of the image-text-pair files into one directory. But after extracting 17 million (17681010 exactly) files, the tool reported the error:

Exception: [Errno 28] No space left on device: '/home/robin/Downloads/cc12m/011647171.txt'

I checked the space and inodes in my ext4 filesystem, and seems they all have free capacity:

# "df -lh"
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           3.2G  2.0M  3.2G   1% /run
/dev/nvme0n1p1  916G  410G  461G  48% /
tmpfs            16G  412K   16G   1% /dev/shm
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
efivarfs        192K  124K   64K  66% /sys/firmware/efi/efivars
/dev/sda2        96M   32M   65M  33% /boot/efi
tmpfs           3.2G   56K  3.2G   1% /run/user/1000
/dev/sdb2       3.7T  2.5T  1.3T  67% /mnt

# "df -i"
Filesystem       Inodes   IUsed    IFree IUse% Mounted on
tmpfs           4077303    1264  4076039    1% /run
/dev/nvme0n1p1 61054976 9308680 51746296   16% /
tmpfs           4077303     104  4077199    1% /dev/shm
tmpfs           4077303       4  4077299    1% /run/lock
efivarfs              0       0        0     - /sys/firmware/efi/efivars
/dev/sda2             0       0        0     - /boot/efi
tmpfs            815460      61   815399    1% /run/user/1000
/dev/sdb2             0       0        0     - /mnt

Then why the ext4 filesystem returned a “No space” error? The reason is explained here: https://blog.merovius.de/posts/2013-10-20-ext4-mysterious-no-space-left-on/.

After using “sudo dumpe2fs /dev/nvme0n1p1”, I got:

...
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      e10f88a7-1d8c-4c38-a796-6fa15bdf4e65
...

Seems the hash algorithm of “index_dir” of my ext4 filesystem is already “half_md4” therefore my only choice is using “tea”. (The default “hash_algo” when you using “mke2fs” is “half_md4“)

But after I make the change:

sudo tune2fs -E "hash_alg=tea" /dev/nvme0n1p1

the error “No space left on device” still jumped out…

There are two solutions left:

Rewrite my tool to generate flat big files with every file contains previous “small files”
Replace ext4 with xfs (I will test this after I got another NVME SSD)

Does sinusoid Positional Embeddings actually work well?

The GPT part of my Multimodal trials mainly comes from nanoGPT. In the nanoGPT, the Positional Encoding is just a learnable tensor (“wpe” means “weights of positional embedding”):

self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            drop = nn.Dropout(config.dropout),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = LayerNorm(config.n_embd, bias=config.bias),
        ))

It’s different from the implementation of the original paper. The original paper mentioned:

We also experimented with using learned positional embeddings instead, and found that the two versions produced nearly identical results.

The “vanilla” Positional Embeddings for the transformer are two functions:

$PE_(pos,2i) = sin(pos/10000^{2i/d_{model}})$

$PE_(pos,2i+1) = cos(pos/10000^{2i/d_{model}})$

Which ones work better in the model training? Let me try running “python train.py config/train_shakespeare_char.py” in nanoGPT and get the best validation loss as metrics.

I wrote my own sinusoid Positional Embeddings for testing:

class GPT(nn.Module):
  def __init__(self, config):
	...
    # Position Embedding from original Transformer paper
    divisor = torch.pow(
        10000, 2 * torch.arange(1, config.n_embd + 1) / config.n_embd
    )
    pe = []
    for pos in range(1, config.block_size + 1):
        if pos % 2 == 0:
            pe.append(torch.sin(pos / divisor).unsqueeze(0))
        else:
            pe.append(torch.cos(pos / divisor).unsqueeze(0))
    self.register_buffer("pos_emb", torch.cat(pe, 0))

The “10000” (let’s call it “base number” for convenience) looks too big for a shorter sequence length, so I do experiments by changing it to “block_size” “2*block_size” etc.

The testing result:

	validation loss
Original nanoGPT	1.4754
Base number: 10000	1.4959
Base number: 4 * block_size	1.4916
Base number: 2 * block_size	1.4995
Base number: 3.14/2 * block_size	1.4870
Base number: block_size	1.4947

From my simple tests, the learnable Positional Embeddings has the best effort. nanoGPT wins this round.

I have a guess about why the author of Transformer chose “10000”. The smallest “pos” is 1 and the biggest $2i/d_{model}$ is 2. Therefore the smallest value in sin() is $1/10000^2=1e-8$ , which is very close to the minimal value of FLOAT16 $5.96e-8$

Multimodal trials: my tiny CLIP implementation

CLIP is already a three years old paper but its simple design and significant performance still attracted me. After one week of programming and debugging, I finished v0.1-version of my tiny CLIP. It uses ConvNextV2 Nano and some parts of nanoGPT so both encoders will keep parameters of about 35millons.

The training dataset is CC3M downloaded by using the tool from img2dataset. The actual number of images is 2.3 million (might be my awful network environment). For the testing dataset, I use the 50000 val images of ImageNet1K.

I split the CC3M into 90% training and 10% validating. Just after one night of training (the electricity fee is much cheaper at night), the result seems too good to be true:

[Eval] loss: 0.2333 accuracy: 0.9257
[003 : 131000] loss: 0.5992 accu: 0.8281 lr: 1.0000e-06 time: 642.28
[004 : 132000] loss: 0.5567 accu: 0.7969 lr: 1.0000e-06 time: 198.91
[004 : 133000] loss: 0.4493 accu: 0.8750 lr: 1.0000e-06 time: 198.52
[004 : 134000] loss: 0.4729 accu: 0.8281 lr: 1.0000e-06 time: 199.15
[004 : 135000] loss: 0.5102 accu: 0.8281 lr: 1.0000e-06 time: 198.22

The accuracy in the 10% validating data is as high as 0.9257, which I guess is caused by this small dataset. The evaluation of 50000 val images of ImageNet1K shows that the top-5 accuracy is just 6.66%. This is far away even from the 2016 paper‘s 11.5% zero-shot accuracy.

Therefore, I will use CC12M in the next step.

There are also some questions I need to solve:

The original paper of CLIP uses L2 normalization on multimodal embeddings. But in my test, the model will not converge with it.
Adding the learnable temperature parameter will cause a training error, which requires a “retain_graph=True” argument for “backward()”
When using “torch.compile()”, it will report a Triton error after the first epoch

Wish me good luck.

Performance of Flash Attention and torch.compile()

I am trying to build a small repo about multi-modal models (CLIP, ALBEF, BLIP etc). The GPT code is mainly from nanoGPT. Then I became inquisitive about the performance of “Flash Attention” and “torch.compile()”.

The metrics with my original code (w/o Flash Attention, w/o torch.compile()):

[100] loss: 4.0315 time 23.7708
[200] loss: 4.0020 time 23.9010
[300] loss: 3.8115 time 23.9407
[400] loss: 3.7021 time 23.9785
[500] loss: 3.6626 time 24.0076
[600] loss: 3.7109 time 24.0060

The metrics after adding Flash Attention:

[100] loss: 4.1204 time 23.0655
[200] loss: 3.8950 time 23.2243
[300] loss: 3.9116 time 23.2714
[400] loss: 3.7837 time 23.2864
[500] loss: 3.8313 time 23.2993
[600] loss: 3.9138 time 23.3255

The metrics after adding Flash Attention and torch.compile()

[100] loss: 3.9969 time 14.8842                                                                                               
[200] loss: 3.8506 time 15.0004                                                                                               
[300] loss: 3.8702 time 15.0050                               
[400] loss: 3.7977 time 15.0061                                                                                               
[500] loss: 3.7374 time 15.0492       
[600] loss: 3.6589 time 15.0661

Seems “torch.compile()” is much more powerful than “Flash Attention”