ops

“No space left on device” problem I met on ext4

After downloading the whole CC12M dataset from Huggingface, I wrote a tool to extract all of the image-text-pair files into one directory. But after extracting 17 million (17681010 exactly) files, the tool reported the error:

Exception: [Errno 28] No space left on device: '/home/robin/Downloads/cc12m/011647171.txt'

I checked the space and inodes in my ext4 filesystem, and seems they all have free capacity:

# "df -lh"
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           3.2G  2.0M  3.2G   1% /run
/dev/nvme0n1p1  916G  410G  461G  48% /
tmpfs            16G  412K   16G   1% /dev/shm
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
efivarfs        192K  124K   64K  66% /sys/firmware/efi/efivars
/dev/sda2        96M   32M   65M  33% /boot/efi
tmpfs           3.2G   56K  3.2G   1% /run/user/1000
/dev/sdb2       3.7T  2.5T  1.3T  67% /mnt

# "df -i"
Filesystem       Inodes   IUsed    IFree IUse% Mounted on
tmpfs           4077303    1264  4076039    1% /run
/dev/nvme0n1p1 61054976 9308680 51746296   16% /
tmpfs           4077303     104  4077199    1% /dev/shm
tmpfs           4077303       4  4077299    1% /run/lock
efivarfs              0       0        0     - /sys/firmware/efi/efivars
/dev/sda2             0       0        0     - /boot/efi
tmpfs            815460      61   815399    1% /run/user/1000
/dev/sdb2             0       0        0     - /mnt

Then why the ext4 filesystem returned a “No space” error? The reason is explained here: https://blog.merovius.de/posts/2013-10-20-ext4-mysterious-no-space-left-on/.

After using “sudo dumpe2fs /dev/nvme0n1p1”, I got:

...
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      e10f88a7-1d8c-4c38-a796-6fa15bdf4e65
...

Seems the hash algorithm of “index_dir” of my ext4 filesystem is already “half_md4” therefore my only choice is using “tea”. (The default “hash_algo” when you using “mke2fs” is “half_md4“)

But after I make the change:

sudo tune2fs -E "hash_alg=tea" /dev/nvme0n1p1

the error “No space left on device” still jumped out…

There are two solutions left:

Rewrite my tool to generate flat big files with every file contains previous “small files”
Replace ext4 with xfs (I will test this after I got another NVME SSD)

The Pub/Sub subscription problem

We have a project using Pub/Sub of Google Cloud. About one month ago, the pipeline failed because the subscription inexplicable disappeared. I suspected someone may mistakenly deleted it. However, after searching for a mount of log of GCP, nothing has been discovered. Without any clue, I just re-created the subscription.

Then comes the new year. The subscription disappeared again. This time, it looks like a system setting instead of a human mistake. Fortunately, we found this doc from GCP: https://cloud.google.com/knowledge/kb/pub-sub-subscriptions-disappeared-without-any-deletion-logs-000004170. When creating a subscription, the default “expiring” is just 31 days. This means this subscription will be deleted automatically if there is no message in it for 31 days.

For safety, we’d better create subscriptions with a longer “expiring”, or just “Never expire”.

Timezone in pods of Argo

Last week I noticed that the pod in Argo would give a UTC timezone even though the Argo configuration has set a AEDT timezone.

apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
  name: my-cron-workflow
  namespace: MYNAMESPACE
spec:
  schedule: "0 10 * * *"
  timezone: "Australia/Sydney"

Actually, this timezone in configuration is only used to give the time this scheduled job would run at, not the timezone in every pod.

To set up a unified timezone for all the pods, we need to use volume.

apiVersion: v1
kind: Pod
metadata:
  name: busybox-sleep
spec:
  containers:
  - name: busybox
    image: busybox
    args:
    - sleep
    - "1000000"
    volumeMounts:
    - name: tz-config
      mountPath: /etc/localtime
  volumes:
    - name: tz-config
      hostPath:
        path: /usr/share/zoneinfo/Australia/Sydney
        type: File

Install new driver for old Nvidia Tesla P100

I was trying to launch a VM instance with GPU on Google Cloud. But after trying T4, L4, and V100, they all reported “exceeding resource limit”, which means a lot of people in my region are using these types of GPUs.

Without choice, I launched a VM instance with an old Nvidia Tesla P100 (I first used it about 5 years ago). Then, I need to install its driver. But the installation process reported errors:

   *** Failed CC version check. ***

     SYMLINK /tmp/selfgz26389/NVIDIA-Linux-x86_64-515.105.01/kernel/nvidia/nv-kernel.o
     SYMLINK /tmp/selfgz26389/NVIDIA-Linux-x86_64-515.105.01/kernel/nvidia-modeset/nv-modeset-kernel.o
    CONFTEST: hash__remap_4k_pfn
    CONFTEST: set_pages_uc
    CONFTEST: list_is_first
    CONFTEST: set_memory_uc
...
	cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'

At first glance, I suspect the GCC compiler is too old. After downgrading the GCC to gcc-10 and gcc-9, the error still existed.

Finally, I noticed that the driver of the Tesla P100 is very new (Release Date: 2023.3.30) and this page mentioned “gcc-12”. Therefore I upgraded the GCC to 12:

sudo apt install gcc-12
sudo ln -sf /usr/bin/gcc-12 /etc/alternatives/cc

Now the driver can be installed successfully.

Use a specific service account in the Argo job

I created a simple Argo job to pull messages from a Google Cloud Pub/Sub topic. Permission has been given to the service account of GKE’s workload identity. But the Argo job failed with errors:

argo submit example.json -n argoproj

hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/api_core/grpc_helpers.py", line 72, in error_remapped_callable
hello-world-pqbm5:     return callable_(*args, **kwargs)
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/grpc/_channel.py", line 1030, in __call__
hello-world-pqbm5:     return _end_unary_response_blocking(state, call, False, None)
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking
hello-world-pqbm5:     raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
hello-world-pqbm5: grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
hello-world-pqbm5:      status = StatusCode.PERMISSION_DENIED
hello-world-pqbm5:      details = "User not authorized to perform this action."
hello-world-pqbm5:      debug_error_string = "UNKNOWN:Error received from peer ipv4:74.125.69.95:443 {grpc_message:"User not authorized to perform this action.", grpc_status:7, created_time:"2023-05-15T01:10:43.128528579+00:00"}"
hello-world-pqbm5: >
hello-world-pqbm5: 
hello-world-pqbm5: The above exception was the direct cause of the following exception:
hello-world-pqbm5: 
hello-world-pqbm5: Traceback (most recent call last):
hello-world-pqbm5:   File "<string>", line 26, in <module>
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/pubsub_v1/services/subscriber/client.py", line 1495, in pull
hello-world-pqbm5:     response = rpc(
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/api_core/gapic_v1/method.py", line 113, in __call__
hello-world-pqbm5:     return wrapped_func(*args, **kwargs)
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/api_core/retry.py", line 349, in retry_wrapped_func
hello-world-pqbm5:     return retry_target(
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/api_core/retry.py", line 191, in retry_target
hello-world-pqbm5:     return target()
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/api_core/timeout.py", line 120, in func_with_timeout
hello-world-pqbm5:     return func(*args, **kwargs)
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/api_core/grpc_helpers.py", line 74, in error_remapped_callable
hello-world-pqbm5:     raise exceptions.from_grpc_error(exc) from exc
hello-world-pqbm5: google.api_core.exceptions.PermissionDenied: 403 User not authorized to perform this action.

Thanks to my colleagues. They remind me that an Argo job needs to specify a service account when running in the workload identity namespace.

argo submit example.json -n argoproj --serviceaccount argo-workflow

Or, I can add this service account to the YAML file:

apiVersion: argoproj.io/v1alpha1
kind: Workflow                  # new type of k8s spec
metadata:
  generateName: hello-world-    # name of the workflow spec
spec:
  entrypoint: whalesay          # invoke the whalesay template
  serviceAccountName: argo-workflow

Upgrade ubuntu to solve a GPU problem

After installing an RTX 2080 TI on an old-2016-desktop at the beginning of 2019, we used it to train YOLOv6 for a while. But recently the training job will occasionally hang and the GPU stops working. The only message I can see is from dmesg

[ 8104.078794] NVRM: GPU at PCI:0000:01:00: GPU-b4f425ef-2d0f-f29e-5624-ff96b37c2c46
[ 8104.078796] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[ 8104.078797] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[ 8104.078803] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

At first, I suspected the NVIDIA driver was too new. But after installing back to an older driver, the same errors jumped out in dmesg. And the problem seems to occur more frequently, sometimes could not hold more than 24 hours.

Considering that Ubuntu 18.04 is too old (also the Linux kernel), I tried to upgrade it. Actually, although I installed a lot of Linux systems and Linux kernels in different machines (servers, desktops, laptops, and even the development board), this is the first time I upgraded an existing Ubuntu system.

By following the guide, I barely upgrade from 18.04 to 20.04. Surprisingly, the new system works well with the older NVIDIA driver and the GPU works smoothly for more than 12 hours now.

In conclusion, we should use a new system (new kernel) with new hardware drivers. If the training job doesn’t report any error, I will go on using this 20.04 and saving the time of upgrading to 22.04

Fixing reported error from terragrunt

When using “terragrunt init”, an error jumped out:

DEBU[0002] Remote state GCS bucket prokyon-systems-state-bucket does not exist. Attempting to create it  prefix=[/data/proj/prokyon-systems/auto-accountant/infra/non-prod/europe-central/dev/storage-bucket] 
ERRO[0002] Missing required GCS remote state configuration project 
ERRO[0002] Unable to determine underlying exit code, so Terragrunt will exit with error code 1

This error looks a little confusing because my attention was completely drawn by the “remote state configuration”. Where could I find the “remote state configuration”? Then I found it in “terragrunt.hcl”:

remote_state {
  backend = "gcs"
  config = {
    bucket  = "my_bucket"
    prefix  = "my_prefix/terraform.tfstate"
  }
}

Which “configuration” did I miss? I went through a long way to finally realize that the plain word “project” at the end of the sentence “Missing required GCS remote state configuration project” is the most important one (this article). I just need to add a configuration item “project” in the “terragrunt.hcl”:

remote_state {
  backend = "gcs"
  config = {
    project = "gcp_project_id"
    bucket  = "my_bucket"
    prefix  = "my_prefix/terraform.tfstate"
  }
}

An interesting problem about ext4 mounting

When I login my computer and try to run “tmux attach” this morning, it reported a strange error:

/tmp/tmux-1001/default (Address already in use)

Intuitively, I thought this temporary file is out of date. So I just type in a command to delete it. But another error jumped out “The filesystem is read-only!”.

By looking at the mount point “mount|grep ro,”, I noticed that my root directory is mounted with “read-only” option. Checking the /etc/fstab:

/dev/disk/by-uuid/69bf5a7f-4031-4a6d-b877-f83fc73a4440 / ext4 rw,discard,data=writeback, 0 1

I guess one of the mount options is wrong so the operating system only mounts a “read-only” filesystem.

After I remove the options one by one and reboot the machine many times, it turns out to be that “data=writeback” is the incorrect option. Essentially, “data=writeback” option is only for ext3.

When I trying to modify /etc/fstab, the system report “you can’t change file because the root filesystem is read-only”. Seems I was trapped in a dead loop… so I use my final weapon:

sudo mount -o remount,rw /dev/nvme0n1p2 /

And it works.

Now, by setting /etc/fstab, the ext4 filesystem could be mounted with both read and write permission:

/dev/disk/by-uuid/69bf5a7f-4031-4a6d-b877-f83fc73a4440 / ext4 rw,discard,noatime 0 1

Download files from Google Drive in the console

I need to download some large files from Google Drive on my server (“server” means no GUI). After a quick search, I got a solution: https://stackoverflow.com/a/50670037/5048046

We can just install it by using pip:

python3 -m pip install gdown

Then just give the URL of the Google Drive file to it:

# https://drive.google.com/uc?id=<file_id>
gdown <file_id>

Try to deploy Argo on GCP Autopilot

Creating an Autopilot cluster in GCP K8S is quite easy. But after deploying Argo and launching our pipeline, the Argo report errors:

Failed to pull image "eu-docker.pkg.dev/project-name/mytag:123456789"
b85d23bf513ba037f4b2fbd5e": rpc error: code = Unknown desc = failed to pull and unpack image eu-docker.pkg.dev/project-name/mytag:123456789": failed to resolve reference "eu-docker.pkg.dev/project-name/mytag:123456789": failed to authorize: failed to fetch oauth token: unexpected status: 403 Forbidden

The solution is (give k8s cluster the permission to pull docker image from our docker repository):

kubectl create secret docker-registry gcr-json-key  --docker-server=eu-docker.pkg.dev  --docker-username=_json_key  --docker-password="$(cat our_service_account.json)"  --docker-email=your@email.address -n argo
kubectl patch serviceaccount default -p '{"imagePullSecrets": [{"name": "gcr-json-key"}]}' -n argo

Then the second problem jumped out:

admission webhook "validation.gatekeeper.sh" deni
ed the request: [denied by autogke-no-write-mode-hostpath] hostPath volume docker-sock used in container wait
uses path /var/run/docker.sock which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes a
re: ["/var/log/"]. Requesting user: <system:serviceaccount:argo:argo> and groups: <["system:serviceaccounts",
"system:serviceaccounts:argo", "system:authenticated"]>

The solution is to set emissary as containerRuntimeExecutor by modifying the file of Argo’s install.yaml:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: workflow-controller-configmap
data:
  config: |
    containerRuntimeExecutor: emissary
    containerRuntimeExecutors:
      - name: emissary
        selector:
          matchLabels:
            workflows.argoproj.io/container-runtime-executor: emissary
      - name: pns
        selector:
          matchLabels:
            workflows.argoproj.io/container-runtime-executor: pns
      - name: k8sapi
        selector:
          matchLabels:
            workflows.argoproj.io/container-runtime-executor: k8sapi

Finally, seems all problems have been solved. My colleague Tianchu find out that Autopilot couldn’t support a pod with memory larger than 80GB:

Since many of our applications need memory more than 80 GB, Autopilot can’t be our choice in recent limitations.