Robin on Linux – Page 16 – All about technology

Use both ‘withParam’ and ‘when’ in Argo Workflows (on Kubernetes)

In Argo, we can use ‘withParam’ to create loop logic:

    - - name: generate
        template: gen-number-list
    # Iterate over the list of numbers generated by the generate step above
    - - name: sleep
        template: sleep-n-sec
        arguments:
          parameters:
          - name: seconds
            value: "{{item}}"
        withParam: "{{steps.generate.outputs.result}}"

But in my YAML, it also use when in Argo:

    - - name: generate
        template: gen-number-list
        when: "{{workflow.parameters.NEED_RUN}} == 1"
    # Iterate over the list of numbers generated by the generate step above
    - - name: sleep
        template: sleep-n-sec
        arguments:
          parameters:
          - name: seconds
            value: "{{item}}"
        withParam: "{{steps.generate.outputs.result}}"
        when: "{{workflow.parameters.NEED_RUN}} == 1"

When the NEED_RUN is 0, the Argo will report error since it can’t find the {{steps.generate.outputs.result}}. Seems the YAML parser of Argo will try to parse withParam before when phrase.
Fortunately we don’t need to modify Argo or Kubernetes to solve this problem — we just need to let template gen-number-list generate a fake output (empty array):

      script:
        image: "{{workflow.parameters.BASEIMAGE}}"
        command: [bash]
        source: |
            if [ $NEED_RUN -eq 1 ]; then
                python3 -u output.py
            else
                echo "[]"
            fi
        env:
          - name: NEED_RUN
            value: "{{workflow.parameters.NEED_RUN}}"

Image pull policy in Kubernetes

Recently, we use Kubernetes for our project. Yesterday, a problem haunted me severely: even I have pushed the docker image to the GCR (Goolge Container Registry), the pod in Kubernetes will still use the stale image.
I tried many ways to solve the problem: removing the image in GCR, removing the image in local laptop, rebuild the image again and again. And finally I have found the reason and also realised that I am still a stupid starter on Kubernetes.
The reason for pod to use stale docker image is: the Kubernetes will (and should, I think) cache the docker images it used before for speed. Hence if you want it to re-pull image forcedly. You should use configuration item imagePullPlicy(ref), like:

spec:
  containers:
    - name: uses-private-image
      image: gcr.io/my-project/my-feature/develop:1.0.1
      imagePullPolicy: Always

Fortunately I can debug my docker image correctly now…

Be careful of the ternary operator in Python

from pathlib import Path
date = yes
my_path = Path(hello) / date if date else no / last
print(my_path)

The result will be:

hello/yes

Where is the last go? It goes with the no. The python interpreter will consider "no" / "last" under the else condition even it actually break the syntax rule. The correct way to write the ternary operator should be:

my_path = Path(hello) / (date if date else no) / last

Now the result become:

hello/yes/last

Grab a hands-on realtime-object-detection tool

Try to get a fast (what I mean is detecting in lesss than 1 second on mainstream CPU) object-detection tool from Github, I experiment with some repositories written by PyTorch (because I am familiar with it). Below are some conclusions:
1. detectron2
This the official tool from Facebook Corporation. I download and installed it successfully. The test python code is:

import detectron2
from detectron2.utils.logger import setup_logger
from detectron2.config import get_cfg
from detectron2.engine import DefaultPredictor
from detectron2 import model_zoo
setup_logger()
# import some common libraries
import numpy as np
import cv2
import sys
import time
cfg = get_cfg()
# add project-specific config (e.g., TensorMask) here if you're not running a model in detectron2's core library
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml"))
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5  # set threshold for this model
# Find a model from detectron2's model zoo. You can use the https://dl.fbaipublicfiles... url as well
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml")
cfg.MODEL.DEVICE = "cpu"
predictor = DefaultPredictor(cfg)
img = cv2.imread(sys.argv[1])
begin = time.time()
outputs = predictor(img)
print("time:", time.time() - begin)
print(outputs)

Although can’t recognize all birds in below image, it will cost more than 5 seconds on CPU (my MackbookPro). Performance is not as good as my expectation.

2. efficientdet
From the paper, the EfficientDet should be fast and accurate. But after I wrote a test program, it totally couldn’t recognize the object at all. Then I gave up this solution.
3. EfficientDet.Pytorch
Couldn’t download models from it’s model_zoo.
4. ssd.pytorch
Finally, I came to my sweet ssd(Single Shot Detection). Since have studied it for more than half a year, I wrote below snippet quickly:

def base_transform(image, size, mean):
    x = cv2.resize(image, (size, size)).astype(np.float32)
    x -= mean
    x = x.astype(np.float32)
    return x
class BaseTransform:
    def __init__(self, size, mean):
        self.size = size
        self.mean = np.array(mean, dtype=np.float32)
    def __call__(self, image, boxes=None, labels=None):
        return base_transform(image, self.size, self.mean), boxes, labels
def detect(img, net, transform):
    FONT = cv2.FONT_HERSHEY_SIMPLEX
    COLORS = [(255, 0, 0), (0, 255, 0), (0, 0, 255)]
    height, width = img.shape[:2]
    x = torch.from_numpy(transform(img)[0]).permute(2, 0, 1)
    x = Variable(x.unsqueeze(0))
    y = net(x)  # forward pass
    detections = y.data[0]
    # scale each detection back up to the image
    scale = torch.Tensor([width, height, width, height])
    for index, loc in enumerate(detections[3]):
        score = loc.numpy()[0]
        if score >= 0.5:
            loc = loc[1:]
            pt = loc * scale
            print(score, pt)
            cv2.rectangle(
                img,
                (int(pt[0]), int(pt[1])),
                (int(pt[2]), int(pt[3])),
                COLORS[index % 3],
                2,
            )
            cv2.putText(
                img,
                str(score),
                (int(pt[0]), int(pt[1])),
                FONT,
                1,
                (255, 255, 255),
                1,
                cv2.LINE_AA,
            )
    return img
img = cv2.imread("bird_matrix.jpg")
net = build_ssd("test", 300, 21)  # initialize SSD
net.load_state_dict(torch.load("ssd300_mAP_77.43_v2.pth", map_location="cpu"))
transform = BaseTransform(net.size, (104 / 256.0, 117 / 256.0, 123 / 256.0))
img = detect(img, net, transform)
cv2.imwrite("result.jpg", img)

The result is not perfect but good enough for my current situation.

Some tips about Argo Workflows (on Kubernetes)

Using Argo to execute workflows last week, I met some problems and also find the solutions.
1. Can’t parse “outputs”
By submitting this YAML file:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: robin-test-
spec:
  entrypoint: firststep
  templates:
  - name: firststep
    steps:
    - - name: generate
        template: generate-run
    - - name: execution
        template: execution-run
        arguments:
          parameters:
          - name: pair
            value: "{{item}}"
        withParam: "{{steps.generate.outputs.result}}"
  - name: generate-run
    container:
      image: gcr.io/robin/feature:latest
      command: ["bash", "-c"]
      args: ["cat my.json"]
  - name: execution-run
    inputs:
      parameters:
      - name: pair
    container:
      image: docker/whalesay
      command: [cowsay]
      args: ["{{inputs.parameters.pair}}"]

I met the error:

Failed to submit workflow: templates.firststep.steps[1].execution failed to resolve {{steps.generate.outputs.result}}]

Why the Argo could’t recognize the “steps.generate.outputs.result”? Because only “source” could have a default “output”, not “args”. So the template “generate-run” should be

...
  - name: generate-run
    container:
      image: gcr.io/robin/feature:latest
      command: ["bash", "-c"]
      source: |
        cat my.json

2. Can’t parse parameters from JSON
If the Argo report:

withParam value could not be parsed as a JSON list: xxx

it means the “output” of the previous step isn’t in standard JSON format. So make sure you have pretty JSON format output. For python, it should be like:

...
  - name: generate-run
    container:
      image: gcr.io/robin/feature:latest
      command: ["bash", "-c"]
      source: |
        python -c "import json; json.dump([{'id': key, 'name': value} for key, value in dictionary], sys.stdout)"

To construct DataFrame more effectively

The old code of python looks like:

import pandas as pd
temp = pd.DataFrame()
for record in table:
    df = pd.DataFrame(record)
    temp = pd.concat([temp, df])
# The final result
result = temp

This snippet above will cost 7 seconds to run on my laptop.
Actually, pd.concat() is an expensive operation for CPU. So let’s replace it with common python dictionary:

import pandas as pd
temp = {}
for record in table:
    temp[record[column_name]] = record[column_value]
    ...
# The final result
result = pd.DataFrame.from_dict(temp)

This snippet only costs 0.03 seconds, which is more effective.

Some problems when using GCP

After I launched a compute engine with container, it report error:

gcr.io/xx/xx-xx/feature:yy
Feb 03 00:12:28 xx-d19b201 konlet-startup[4664]: {“errorDetail”:{“message”:”failed to register layer: Error processing tar file(exit status 1): write /xxx/2020-01-16/base_cmd/part-00191-2e99af0e-1615-42af-9c60-910f9a9e6a17-c000.snappy.parquet: no space left on device”},”error”:”failed to register layer: Error processing tar file(exit status 1): write /xxx/2020-01-16/base_cmd/part-00191-2e99af0e-1615-42af-9c60-910f9a9e6a17-c000.snappy.parquet: no space left on device”}

The key is in the no space left on device. Then I use df to see the disk space:

Filesystem      Size  Used Avail Use% Mounted on
/dev/root       1.2G  879M  343M  72% /
devtmpfs         26G     0   26G   0% /dev
tmpfs            26G     0   26G   0% /dev/shm
tmpfs            26G  436K   26G   1% /run
tmpfs            26G     0   26G   0% /sys/fs/cgroup
tmpfs           1.0M  108K  916K  11% /etc/machine-id
tmpfs            26G     0   26G   0% /tmp
tmpfs           256K     0  256K   0% /mnt/disks
overlayfs       1.0M  108K  916K  11% /etc
/dev/sda8        12M   28K   12M   1% /usr/share/oem
/dev/sda1       5.7G  5.7G     0 100% /mnt/stateful_partition
tmpfs           1.0M  4.0K 1020K   1% /var/lib/cloud

Obviously the space on /mnt/stateful_partition has been used out. The solution is simple: add new argument for gcloud command

gcloud compute instances create-with-container [INSTANCE_NAME] \
     --container-image [DOCKER_IMAGE] \
     --boot-disk-size=30GB

Another problem occurred when I trying to launch an instance of Cloud Run. It reported a mess:

Traceback (most recent call last): File “/usr/local/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py”, line 98, in refresh request, service_account=self._service_account_email File “/usr/local/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py”, line 241, in get_service_account_token request, “instance/service-accounts/{0}/token”.format(service_account) File “/usr/local/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py”, line 172, in get response, google.auth.exceptions.TransportError: (“Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/564585695625-compute@developer.gserviceaccount.com/token from the Google Compute Enginemetadata service. Status: 500 Response:\nb’Could not fetch URI /computeMetadata/v1/instance/service-accounts/564585695625-compute@developer.gserviceaccount.com/token\\n'”, )

Actually, the reason is quite simple: I haven’t realized that Cloud Run need its instance to listen on PORT. Otherwise, the service will not be launched successfully.

Problem about installing Kubeflow

Try to install Kubeflow by following this guide. But when I run

kfctl apply -V -f https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_k8s_istio.0.7.1.yaml

it reports

INFO[0000] Downloading https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_k8s_istio.0.7.1.yaml to /var/folders/yr/z2fx0f3j0ns567817_vwdwknv3w9vm/T/429156444/tmp.yaml  filename="utils/k8utils.go:169"
INFO[0000] Downloading https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_k8s_istio.0.7.1.yaml to /var/folders/yr/z2fx0f3j0ns567817_vwdwknv3w9vm/T/638207599/tmp_app.yaml  filename="configconverters/converters.go:71"
Error: failed to build kfApp from URI https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_k8s_istio.0.7.1.yaml:  (kubeflow.error): Code 400 with message: current directory /Users/1103308/Downloads/kf/robin-kf not empty, please switch directories
Usage:
  kfctl apply -f ${CONFIG} [flags]
Flags:
  -f, --file string   Static config file to use. Can be either a local path:
                      		export CONFIG=./kfctl_gcp_iap.yaml
                      	or a URL:
                      		export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_gcp_iap.0.7.0.yaml
                      		export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_existing_arrikto.0.7.0.yaml
                      		export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_aws.0.7.0.yaml
                      		export CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_k8s_istio.0.7.0.yaml
                      	kfctl apply -V --file=${CONFIG}
  -h, --help          help for apply
  -V, --verbose       verbose output default is false
failed to build kfApp from URI https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_k8s_istio.0.7.1.yaml:  (kubeflow.error): Code 400 with message: current directory /Users/1103308/Downloads/kf/robin-kf not empty, please switch directories

It did cost me some time to find the solution. So let’s try to make it short:

Download file https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_k8s_istio.0.7.1.yaml, and find some of its bottom lines:

- name: manifests
    uri: https://github.com/kubeflow/manifests/archive/v0.7-branch.tar.gz

Download the https://github.com/kubeflow/manifests/archive/v0.7-branch.tar.gz, untar it, and then there will be a new directory “manifests-0.7-branch”
Change the “uri:” in kfctl_k8s_istio.0.7.1.yaml to “uri: /full/path/manifests-0.7-branch”

Now, we could run kfctl apply -V -f ${CONFIG_URI} successfully.
Seems although Kubeflow has been developed for almost two years, there are still some basic problem exists in it. A little disappointment to me.

Directly deploy containers on GCP VM instance

We can directly deploy containers into VM instance of Google Compute Engine, instead of launching a heavy Kubernetes cluster. The command looks like:

gcloud compute instances create-with-container $(VM_NAME) \
    --container-image=$(IMAGE_NAME) \
    --machine-type n1-standard-1 \
    --labels team=mle,product=decision-engine \
    --zone us-east1-a

To add enviroment variables to this container, we just need to add an argument:

    --container-env-file env.list

To let the container run command for us, we need to add command arguments:

    --container-command "/bin/bash" \
    --container-arg="-c" \
    --container-arg="make all; bash next.sh"

There is still a problem: the VM instance will run this container again and again even the result of the task in container is successful.
To solve this, we just need to add another argument:

    --container-restart-policy on-failure

How to ignore illegal sample of dataset in PyTorch?

I have implemented a dataset class for my image samples. But it can’t handle the situation that a corrupted image has been read:

import torch.utils.data as data
class MyDataset(data.Dataset):
  ...
  def __getitem__(self, index):
    image = cv2.imread(image_list[index])
    if image is None:
      # What should we do?
...

The correct solution is in Pytorch Forum. Therefore I changed my code:

class MyDataset(data.Dataset):
  ...
  def __getitem__(self, index):
    image = cv2.imread(image_list[index])
    if image is None:
      return None
    # Other preprocessing
    ...
def my_collate(batch):
    batch = filter(lambda img: img is not None, batch)
    return data.dataloader.default_collate(list(batch))
dataset = MyDataset()
loader = data.DataLoader(dataset, collate_fn=my_collate)

But it reports:

Loading data exception: Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 108, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "train.py", line 197, in my_collate
    return data.dataloader.default_collate(batch)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 34, in default_collate
    elem_type = type(batch[0])
TypeError: 'filter' object is not subscriptable

Seems default_collate() couldn’t recognize the ‘filter’ object. Don’t worry. We can just add a small function: list()

def my_collate(batch):
  ...
  return data.dataloader.default_collate(list(batch))