Robin on Linux – Page 3 – All about technology

Use a specific service account in the Argo job

I created a simple Argo job to pull messages from a Google Cloud Pub/Sub topic. Permission has been given to the service account of GKE’s workload identity. But the Argo job failed with errors:

argo submit example.json -n argoproj

hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/api_core/grpc_helpers.py", line 72, in error_remapped_callable
hello-world-pqbm5:     return callable_(*args, **kwargs)
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/grpc/_channel.py", line 1030, in __call__
hello-world-pqbm5:     return _end_unary_response_blocking(state, call, False, None)
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking
hello-world-pqbm5:     raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
hello-world-pqbm5: grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
hello-world-pqbm5:      status = StatusCode.PERMISSION_DENIED
hello-world-pqbm5:      details = "User not authorized to perform this action."
hello-world-pqbm5:      debug_error_string = "UNKNOWN:Error received from peer ipv4:74.125.69.95:443 {grpc_message:"User not authorized to perform this action.", grpc_status:7, created_time:"2023-05-15T01:10:43.128528579+00:00"}"
hello-world-pqbm5: >
hello-world-pqbm5: 
hello-world-pqbm5: The above exception was the direct cause of the following exception:
hello-world-pqbm5: 
hello-world-pqbm5: Traceback (most recent call last):
hello-world-pqbm5:   File "<string>", line 26, in <module>
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/pubsub_v1/services/subscriber/client.py", line 1495, in pull
hello-world-pqbm5:     response = rpc(
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/api_core/gapic_v1/method.py", line 113, in __call__
hello-world-pqbm5:     return wrapped_func(*args, **kwargs)
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/api_core/retry.py", line 349, in retry_wrapped_func
hello-world-pqbm5:     return retry_target(
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/api_core/retry.py", line 191, in retry_target
hello-world-pqbm5:     return target()
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/api_core/timeout.py", line 120, in func_with_timeout
hello-world-pqbm5:     return func(*args, **kwargs)
hello-world-pqbm5:   File "/usr/local/lib/python3.9/dist-packages/google/api_core/grpc_helpers.py", line 74, in error_remapped_callable
hello-world-pqbm5:     raise exceptions.from_grpc_error(exc) from exc
hello-world-pqbm5: google.api_core.exceptions.PermissionDenied: 403 User not authorized to perform this action.

Thanks to my colleagues. They remind me that an Argo job needs to specify a service account when running in the workload identity namespace.

argo submit example.json -n argoproj --serviceaccount argo-workflow

Or, I can add this service account to the YAML file:

apiVersion: argoproj.io/v1alpha1
kind: Workflow                  # new type of k8s spec
metadata:
  generateName: hello-world-    # name of the workflow spec
spec:
  entrypoint: whalesay          # invoke the whalesay template
  serviceAccountName: argo-workflow

Empty messages received by PubSub pull()

I want my Python script to receive one message from a PubSub topic and then go on to other work. The code is learned from an example of the GCP document:

with subscriber:
    # The subscriber pulls a specific number of messages. The actual
    # number of messages pulled may be smaller than max_messages.
    response = subscriber.pull(
        request={"subscription": subscription_path, "max_messages": NUM_MESSAGES},
        retry=retry.Retry(deadline=300),
    )

    if len(response.received_messages) == 0:
        return

The problem is that it will receive empty messages, meaning that “len(response.received_messages)” is zero.

Where do these empty messages come from? Here is the answer:

Once a message is sent to a subscriber, the subscriber must either acknowledge or drop the message. A message is considered outstanding once it has been sent out for delivery and before a subscriber acknowledges it.

My solution is just to wait until receiving a non-empty message:

with subscriber:
    # The subscriber pulls a specific number of messages. The actual
    # number of messages pulled may be smaller than max_messages.
    while True:
      response = subscriber.pull(
          request={"subscription": subscription_path, "max_messages": NUM_MESSAGES},
          retry=retry.Retry(deadline=300),
      )

      if len(response.received_messages) > 0:
          break

Hanging of PyTorch’s data loader

Long story short. I am trying to build a Siamese network for audio classification. For 50% possibility, the “dataset.py” will try to find a pair of audios in the same category but with different files (also, different category for another 50% possibility). But when the evaluating start, it will hang after fetching a few batches. The trace could be see:

Traceback (most recent call last):                                                                                                                                                                                                        
  File "/home/robin/song/birdclef/old_train.py", line 395, in <module>                                                
    train(args, train_loader, eval_loader)                                                                                                                                                                                                  
  File "/home/robin/song/birdclef/old_train.py", line 280, in train                                                   
    accuracy = evaluate(args, net, eval_loader)                                                                                                                                                                                             
  File "/home/robin/song/birdclef/old_train.py", line 91, in evaluate                                                 
    sounds1, sounds2, type_ids = next(batch_iterator)                                                                 
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()                                                                                                                                                                                                                
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
    idx, data = self._get_data()                                                                                      
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1285, in _get_data                                                                                                              
    success, data = self._try_get_data()                                                                                                                                                                                                    
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    data = self._data_queue.get(timeout=timeout)                                                                      
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/queue.py", line 180, in get                                   
    self.not_empty.wait(remaining)                                                                                    
  File "/home/robin/miniconda3/envs/bird/lib/python3.10/threading.py", line 324, in wait                              
    gotit = waiter.acquire(True, timeout)                                                                                                                                                                                                   
KeyboardInterrupt

As usual, I start with suspection of PyTorch. Is the version of PyTorch too new (2.0) that it includes some flaws? Then I quickly rejected my thoughts: if it’s the problem of PyTorch, why it didn’t meet same situation when not using Siamese network?

Then I found this issue in PyTorch GitHub page. It pointed to the clue: the new code in “dataset.py”. Now I notice the problem in my code:

            arr = self.cat_map[ebird_code]
            pair_wav_name = np.random.choice(arr)
            while pair_wav_name == wav_name:
                pair_wav_name = np.random.choice(arr)
            pair_sound = self.get_sound(pair_wav_name, ebird_code)

If a category only have one file, this loop will continue forever. This is the reason of the hang.

The solution is simple:

            arr = self.cat_map[ebird_code]
            if len(arr) > 1:
                pair_wav_name = np.random.choice(arr)
                while pair_wav_name == wav_name:
                    pair_wav_name = np.random.choice(arr)
            else:
                pair_wav_name = wav_name
            pair_sound = self.get_sound(pair_wav_name, ebird_code)

A powerful tool to monitor details of Intel CPU

In the research of PCIE 3.0 versus PCIE 4.0, I became serious about the actual application scenario. What’s the real bandwidth between CPU and GPU when we are training a deep learning model?

Finally, I got this tool: pcm

After building it, I run “sudo ./bin/pcm” and got this:

Grateful that I can even see the IPC(Instructions Per Cycle), and L2/L3 hit ratio from this tool. But my most interesting metric is the PCIE bandwidth. Where is the PCIE bandwidth?

I tried “sudo bin/pcm-pcie” but it said my desktop CPU (i5-12400) is not supported:

The processor is not susceptible to Rogue Data Cache Load: yes
The processor supports enhanced IBRS                     : yes
Package thermal spec power: 65 Watt; Package minimum power: 0 Watt; Package maximum power: 0 Watt;

INFO: Linux perf interface to program uncore PMUs is present

For non-CSV mode delay < 1.0s does not make a lot of practical sense. Default delay 1s is used. Consider to use CSV mode for lower delay values
Update every 1 seconds

Detected 12th Gen Intel(R) Core(TM) i5-12400 "Intel(r) microarchitecture codename Alder Lake" stepping 5 microcode level 0x2c
Jaketown, Ivytown, Haswell, Broadwell-DE, Skylake, Icelake, Snowridge and Sapphirerapids Server CPU is required for this tool! Program aborted
Cleaning up
 Closed perf event handles
 Zeroed uncore PMU registers

Then a new idea jumped out of my mind: what my CPU do in my application is only read data from file and push them to GPU, so the bandwidth of reading memory is approximately the writing bandwidth of PCIE!

To verify my idea, I changed my model from “tf_efficientnetv2_s_in21k” to “tf_mobilenetv3_small_075” (using a smaller model could let CPU pump more data into GPU)

As we can see, the bandwidth of READ memory increased from “1.36GB” to “13.69GB”. This shall be equal to the bandwidth of PCIe (since the data from memory will only go to the GPU).

Seems we really need PCIE 4.0 for deep learning 🙂

Use bits instead of set for visited nodes. LeetCode #1434

My first idea is depth-first-search: iterate all people, try to give them different hats. The solution got TLE (Time Limit Exceeded). Then as a hint from discussion forum, I started to iterate hat (instead of people), try to give them different people. The solution also got TLE (even I used lru_cache for function):

from collections import defaultdict

class Solution:
        
    def numberWays(self, hats: List[List[int]]) -> int:
        hp = defaultdict(set)
        for index, hat in enumerate(hats):
            for _id in hat:
                hp[_id].add(index)
                
        hp = [people for people in hp.values()]
        @functools.lru_cache(None)
        def dfs(start, path) -> int:
            if len(path) == len(hats):
                return 1
            if start == len(hp):
                return 0
            total = 0
            for person in (hp[start] - set(path)):
                total += dfs(start + 1, tuple(list(path) + [person]))
            total += dfs(start + 1, path)
            return total % (10**9 + 7)

        return dfs(0, tuple())

Using list as data structure to record visited node is not efficient enough in this case. Since there will be no more than 10 people, the most efficient data structure to record visited people is bits.

My final solution is still using dfs (by using lru_cache, it is also a dynamic-programming):

from collections import defaultdict

class Solution:
        
    def numberWays(self, hats: List[List[int]]) -> int:
        hp = defaultdict(set)
        for index, hat in enumerate(hats):
            for _id in hat:
                hp[_id].add(index)
                
        hp = [people for people in hp.values()]
        @functools.lru_cache(None)
        def dfs(start, mask) -> int:
            if bin(mask).count('1') == len(hats):
                return 1
            if start == len(hp):
                return 0
            total = 0
            for person in hp[start]:
                if (1 << person) & mask > 0:
                    continue
                mask |= 1 << person
                total += dfs(start + 1, mask)
                mask ^= 1 << person
            total += dfs(start + 1, mask)
            return total % (10**9 + 7)

        return dfs(0, 0)

Upgrade ubuntu to solve a GPU problem

After installing an RTX 2080 TI on an old-2016-desktop at the beginning of 2019, we used it to train YOLOv6 for a while. But recently the training job will occasionally hang and the GPU stops working. The only message I can see is from dmesg

[ 8104.078794] NVRM: GPU at PCI:0000:01:00: GPU-b4f425ef-2d0f-f29e-5624-ff96b37c2c46
[ 8104.078796] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[ 8104.078797] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[ 8104.078803] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

At first, I suspected the NVIDIA driver was too new. But after installing back to an older driver, the same errors jumped out in dmesg. And the problem seems to occur more frequently, sometimes could not hold more than 24 hours.

Considering that Ubuntu 18.04 is too old (also the Linux kernel), I tried to upgrade it. Actually, although I installed a lot of Linux systems and Linux kernels in different machines (servers, desktops, laptops, and even the development board), this is the first time I upgraded an existing Ubuntu system.

By following the guide, I barely upgrade from 18.04 to 20.04. Surprisingly, the new system works well with the older NVIDIA driver and the GPU works smoothly for more than 12 hours now.

In conclusion, we should use a new system (new kernel) with new hardware drivers. If the training job doesn’t report any error, I will go on using this 20.04 and saving the time of upgrading to 22.04

Effectively fetch the smallest element in a heap of Python

To solve Leetcode #1675, I wrote the below code with the help of hints:

import sys
import heapq

class Solution:
    def minimumDeviation(self, nums: List[int]) -> int:
        n = len(nums)
        heapq.heapify(nums)
        _max = max(nums)
        ans = max(nums) - min(nums)
        while True:
            item = heapq.heappop(nums)
            if item % 2 == 1:
                heapq.heappush(nums, item * 2)
                _max = max(_max, item * 2)
                ans = min(ans, _max - heapq.nsmallest(1, nums)[0])
            else:
                heapq.heappush(nums, item)
                break
        print("stage1:", nums)
        nums = [-item for item in nums]
        heapq.heapify(nums)
        _max = max(nums)
        while True:
            item = heapq.heappop(nums)
            if item % 2 == 0:
                heapq.heappush(nums, item // 2)
                _max = max(_max, item // 2)
                ans = min(ans, _max - heapq.nsmallest(1, nums)[0])
            else:
                break
                
        return ans

I know, the code looks quite messy. But the more annoying problem is: it exceeded the time limit.

After learning the solutions from these smart guys, I realized how stupid I am — I can just use nums[0] instead of heapq.nsmallest(1, nums)[0] to get the smallest element in a heap, just as the official document said.

Then I just change two lines of my code and it passed all the test cases in time:

import sys
import heapq

class Solution:
    def minimumDeviation(self, nums: List[int]) -> int:
        n = len(nums)
        heapq.heapify(nums)
        _max = max(nums)
        ans = max(nums) - min(nums)
        while True:
            item = heapq.heappop(nums)
            if item % 2 == 1:
                heapq.heappush(nums, item * 2)
                _max = max(_max, item * 2)
                ans = min(ans, _max - nums[0])
            else:
                heapq.heappush(nums, item)
                break
        print("stage1:", nums)
        nums = [-item for item in nums]
        heapq.heapify(nums)
        _max = max(nums)
        while True:
            item = heapq.heappop(nums)
            if item % 2 == 0:
                heapq.heappush(nums, item // 2)
                _max = max(_max, item // 2)
                ans = min(ans, _max - nums[0])
            else:
                break
                
        return ans

A tip for the time complexity of LeetCode #127

The first intuitive idea that jumps out of my mind after taking a look at LeetCode #127 is using the graph algorithm. But for building the graph first, I need to traverse the wordList by O(n²) times.

Here comes the time complexity analysis: the length of the wordList is about 5000, O(n²) means 5000*5000=25*10⁶. For a python script in LeetCode, it will cost about 1 second for every 10⁶ operations. Thus 25*10⁶ will cost about 25 seconds, which is too long for a LeetCode question.

Therefore the best method to build a graph is not to traverse the wordList multiple times, but to just iterate all lower-case alphabets (be aware of the constraints beginWord, endWord, and wordList[i] consist of lowercase English letters.). By just iterating lower-case alphabets, I can reduce time to 260*5000=1.3*10⁶ (the max length of words in wordList is 10).

The code below also uses my old trick of visited nodes.

from collections import defaultdict

class Solution:
    def ladderLength(self, beginWord: str, endWord: str, wordList: List[str]) -> int:
        words_set = set(wordList)
        conns = defaultdict(set)
        for word in wordList + [beginWord]:
            for index in range(len(word)):
                conns[word] |= {word[:index] + cand + word[index+1:] for cand in "abcdefghijklmnopqrstuvwxyz" if cand != word[index] and word[:index] + cand + word[index+1:] in words_set}
        # bfs
        queue = {beginWord}
        already = set()
        ans = 1
        while queue:
            new_queue = set()
            for node in queue:
                for _next in conns[node]:
                    if _next == endWord:
                        return ans + 1
                    new_queue.add(_next)
            already |= queue
            queue = new_queue - already
            ans += 1
        return 0

Divide and Conquer solution for LeetCode #494

The popular solution for LeetCode #494 is dynamic programming. But my first idea is Divide and Conquer. Although it’s not very fast, it’s another idea:

from collections import Counter

class Solution:
    def get_results(self, nums: List[int]) -> Counter:
        layer = [nums[0], -nums[0]]
        for num in nums[1:]:
            new_layer = []
            for item in layer:
                for val in [-num, num]:
                    new_layer.append(item + val)
            layer = new_layer
        return Counter(layer)
    
    def findTargetSumWays(self, nums: List[int], target: int) -> int:
        n = len(nums)
        if n == 1:
            return [0, 1][nums[0] == target or -nums[0] == target]
        half = n // 2
        left = self.get_results(nums[:half])
        right = self.get_results(nums[half:])
        ans = 0
        for lkey, lcnt in left.items():
            rcnt = right[target - lkey]
            ans += rcnt * lcnt
        return ans

An improvement makes the pass of LeetCode #2359

The first idea that jumped out of my mind was using Sets to track two nodes and pick up the first intersection node between these two Sets. Hence came out the first solution:

from collections import defaultdict

class Solution:
    def bfs(self, node1: int, node2: int, conns, length) -> int:
        set1 = {node1}
        set2 = {node2}
        step = 0
        while step <= length:
            inter = set1 & set2
            if len(inter) > 0:
                return min(list(inter))
            new_set1 = set()
            new_set2 = set()
            for node in set1:
                new_set1 |= conns[node]
            for node in set2:
                new_set2 |= conns[node]
            if len(new_set1 - set1) <= 0 and len(new_set2 - set2) <= 0:
                return -1
            set1 |= new_set1
            set2 |= new_set2
            step += 1
    
    def closestMeetingNode(self, edges: List[int], node1: int, node2: int) -> int:
        conns = defaultdict(set)
        for index, edge in enumerate(edges):
            if edge >= 0:
                conns[index].add(edge)
        return self.bfs(node1, node2, conns, len(edges))

I am pretty satisfied with this the simplicity of the above code. But unfortunately, it exceeded the time limit.

Sometimes we might not need to start a new solution before optimising the first one. Maybe I don’t need to use Set since they are too expensive in Python. Using an array to track all visited nodes instead and meeting a VISITED node means “intersection”. To distinguish visiting from two different nodes, I let Node1 mark “1” in the array and Node2 mark “2”. Then comes out my second solution. It’s a little longer but uses arrays instead of Sets:

class Solution:
    def closestMeetingNode(self, edges: List[int], node1: int, node2: int) -> int:
        if node1 == node2:
            return node1
        n = len(edges)
        visited = [0] * n
        step = 0
        visited[node1] = 1
        visited[node2] = 2
        while True:
            ans = []
            old_node1 = node1
            nxt = edges[node1]
            if nxt >= 0:
                if visited[nxt] == 0:
                    visited[nxt] = 1
                    node1 = nxt
                elif visited[nxt] == 2:
                    ans.append(nxt)
            old_node2 = node2
            nxt = edges[node2]
            if nxt >= 0:
                if visited[nxt] == 0:
                    visited[nxt] = 2
                    node2 = nxt
                elif visited[nxt] == 1:
                    ans.append(nxt)
            if len(ans) > 0:
                return min(ans)
            if old_node1 == node1 and old_node2 == node2:
                return -1
        return -1

As above, I use “old_node1” and “old_node2” to check for a dead loop. It beats 97% on time-spending. Not bad.