Robin on Linux – Page 12 – All about technology

Strange time output in a container of Kubernetes cluster

After running a workflow in Argo, I found out the output of the “date” command is totally wrong:

# date
Wed Mar 3 00:41:27 2021
# TZ='America/Los_Angeles' date
Wed Mar 3 00:41:36 2021
# TZ='America/New_York' date
Wed Mar 3 00:41:38 2021
# TZ='Australia/Sydney' date
Wed Mar 3 00:42:01 2021

No matter what timezones I used, the time from the “date” command looks didn’t change at all.

This phenomenon only existed in the container of our k8s cluster. In my personal VM, it works just fine.

Finally, my colleague Tianchu found out the reason: the docker image used by the container of the k8s cluster didn’t even install the timezone file! And the solution is simple: just install the timezone data

DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends tzdata

Some thoughts about cuDF and cuML

I just received an email from NVIDIA about their RAPIDS. Although the cuDF and cuML look fantastic for a data scientist. I am still doubtful about them.

In our daily work, we usually process small DataFrame by Pandas, so cuDF will be too expensive since it needs GPU. And even we need to join two large DataFrame, we tend to use BigQuery, for it’s distributed and relatively cheap. The only proper case for cuDF I think is some heavy operations on less than 8GB data. Who need so many heavy operations on a DataFrame? I don’t know.

For cuML, it’s more like a GPU version scikit-learn. Actually, for tabular data we use XGBoost/LightGBM, for non-structure-data we use PyTorch/Tensorflow. Who will even use scikit-learn? Not even mention the cuML.

To put Back-Quote in a string of Bash

It’s very simple to print a word “hello” in Bash:

echo "hello"

But how to print a word with Back-Quotes?

echo "`hello`"
# It will report error because Bash will try to run 'hello' as a command
bash: hello: command not found

There are two common solutions for this problem. One is using back-slash:

echo "\`hello\`"

Another is just using single quote for words:

echo '`hello`'

How to gracefully end a PySpark application

This article recommend using “return” to jump out of a PySpark application. But after I did by following what he said. It reports error:

  File "test.py", line 333
    return
    ^
SyntaxError: 'return' outside function

Seems it can’t work. After trying to run PySpark application on my own laptop, I finally got the correct answer:

import sys
if df.rdd.isEmpty():
  sys.exit(0)

An old bug about PyArrow

To save memory for my program using Pandas, I change types of some column from string to category as the reference.

df[["os_type", "cpu_type", "chip_brand"]] =
	df[["os_type", "cpu_type", "chip_brand"]].astype("category")

It could save at least half memory in my case. But when I use pyarrow to store the dataframe to parquet

df.to_parquet("my.parquet")

it reports errors:

Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647

It’s a bug from old version pyarrow and had been fixed in Sep 2019. Then I tried to upgrade my pyarrow-0.12.1 to pyarrow-0.17.1 and it fixed this error.

But the story hasn’t ended up here.

For pyarrow-0.12.1, the below snippet will return a class of type <pyarrow.lib.Column>

import pyarrow.parquet as pq
table = pq.read_table(path)
table.column(0)

and this class will also contain a attribute “Column name”

But for pyarrow-0.17.1, the same code will return a class of type <pyarrow.lib.ChunkedArray> which doesn’t have a “Column name”.

This difference will make some code fail (actually, our program). Beware of this: after you upgrade pyarrow (or any other library in Python), run the test to make sure all the legacy code work properly.

A stupid mistake in the new deep learning experiment

After my old colleague, JianMei prepared about 1TB data of the birds’ sound records (every mp3 file will be transferred to an image by using spectrogram and split into chunks with each chunk 2.5 seconds period. After all, every file is a 1250×78 multi-dimension array), I started training with almost the same code using in bird image classification.

The train-accuracy rises very slowly so I add a line of code to normalize every input sample:

image = (image - image.mean()) / image.std()

After that, the train-accuracy could rise faster, but the eval-accuracy still quite low.

In order to find out the root of the problem, I started to train from only two classes: “Black-capped Donacobius” and “Blue-eared Barbet”

Eval accuracy:0.965278 | Train accuracy:1.000000

The result seems pretty good. So I increate the number of classes to 20

Eval accuracy:0.211102 | Train accuracy:0.840278
Eval accuracy:0.203834 | Train accuracy:0.885417
Eval accuracy:0.191514 | Train accuracy:0.904247
Eval accuracy:0.193245 | Train accuracy:0.916667
Eval accuracy:0.210894 | Train accuracy:0.916667
Eval accuracy:0.190269 | Train accuracy:0.932292

Are some types of bird hard to generalize in deep learning model? Then I began to consider how to find these “hard to train” bird type: maybe start from 2 classes and increase the number of classes step by step, and then draw a few curves about the train-accuracy and eval-accuracy…

Suddenly I realized that I just use normalization in the training sample but not evaluation sample!

What a stupid mistake. It wasted me a whole stuffy afternoon for nothing. I really should remember this lesson: do what you do in training samples to evaluation samples, except dropout.

A few notes for Pandas and BigQuery

Get the memory size of a DataFrame of Pandas

df.memory_usage(deep=True).sum()

2. Upload a large DataFrame of Pandas to BigQuery table

If your DataFrame is too big, the uploading operation will report “UDF out of memory”

google.api_core.exceptions.BadRequest: 400 Resources exceeded during query execution: UDF out of memory.; Failed to read Parquet file [...]. This might happen if the file contains a row that is too large, or if the total size of the pages loaded for the queried columns is too large.

The solution is as simple as splitting the DataFrame and upload them one by one:

client = bigquery.Client()
for df_chunk in np.array_split(df, 10):
    job_config = bigquery.LoadJobConfig()
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
    job = client.load_table_from_dataframe(df_chunk, table_id, job_config=job_config)
    job.result()

3. Restore table in BigQuery

How to recover a deleted table in BigQuery? Just use bq command

bq cp dataset.table@1577833205000 dataset.new_table

If your <timestamp> is not correct, the bq command will give you a notification about what <timestamp> is right for this table. Then you can use that correct <timestamp> again.

Import date column in Pandas to BigQuery

Imaging we have a small CSV file:

name,enroll_time
robin,2021-01-15 09:50:33
tony,2021-01-14 01:50:33
jaime,2021-01-13 00:50:33
tyrion,2021-2-15 13:22:17
bran,2022-3-16 14:00:01

Let’s try to load it into DataFrame of Pandas and upload it to a table of BigQuery:

import pandas as pd
from google.cloud import bigquery
df = pd.read_csv("test.csv", parse_dates=["enroll_time"], index_col=0)
schema = []
schema.append(bigquery.SchemaField("name", "STRING"))
schema.append(bigquery.SchemaField("enroll_time", "DATE"))
job_config = bigquery.LoadJobConfig(schema=schema)
bq_client = bigquery.Client()
table = "project.dataset.test_table"
job = bq_client.load_table_from_dataframe(
    df, table, job_config=job_config
)
job.result()

But it reports error:

  File "pyarrow/array.pxi", line 176, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to date32[day] would lose data: 1610704233000000000

Seems the BigQuery library couldn’t recognize the 1610704233000000000 as nano-seconds. Then I tried to divide the 1610704233000000000 with 1e9 but it also failed.

Actually what we need to do is just use TIMESTAMP instead of DATE as the type of column enroll_time:

schema.append(bigquery.SchemaField("name", "STRING"))
schema.append(bigquery.SchemaField("enroll_time", "TIMESTAMP"))

and the BigQuery library could recognize the column even with nano-seconds unit.

To solve the problem about pivot() of Pandas

Below is an example from pandas official document for pivot():

import pandas as pd
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', 'two'],
                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'baz': [1, 2, 3, 4, 5, 6],
                   'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
result = df.pivot(index='foo', columns='bar', values='baz')
print(result)

the result will be

bar  A  B  C
foo
one  1  2  3
two  4  5  6

But if there are duplicate rows in the Dataframe, it will report error:

ValueError: Index contains duplicate entries, cannot reshape

To fix this, we can just drop_duplicates() first and then pivot():

result = df.drop_duplicates().pivot(index='foo', columns='bar', values='baz')

As matter of fact, there are situations that drop_duplicates() couldn’t fix:

df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', 'two'],
                   'bar': ['A', 'A', 'C', 'A', 'B', 'C'],
                   'baz': [1, 2, 3, 4, 5, 6],
                   'zoo': ['x', 'y', 'z', 'q', 'w', 't']})

Now we will need to use groupby() and unstack() to replace pivot():

result = (df.groupby(["foo", "bar"])
    .baz
    .first()
    .unstack()
)

And the result is

bar    A    B    C
foo
one  1.0  NaN  3.0
two  4.0  5.0  6.0

Books I read in the year 2020

Since beginning my new job on 6th January, I start to learn Kubernetes for the first time. Frankly speaking, Kubernetes is very powerful and also easy to use (from my perspective). And the book “Kubernetes in action” is a really good book for beginners.

Bought seven books of “A song of ice and fire” by 35 dollars from eBay in last year, I finally start to read this epic novel. The world built by Martin is so cruel, so cold, and so attractive. The “A clash of kings” maybe the thickest single book I have read in my memory.

To learn more hardware knowledge, I have gone through the fifth version of “Computer Architecture“. Passed a lot of paragraphs, just got a preliminary view about modern CPU architecture. Hope to have the opportunity in future to read it more elaborately.

In spite of writing by the same author Wolfgang Faust, the “Tiger Tracks” is not as evocative as “The last panther“, which is the best war literature I read until today.

“Ensemble Methods” is written by a Chinese professor at Nanjing University, Zhi-Hua Zhou. I love his book because they are very easy to understand and also could give me a lot of new concepts and knowledge. Though this book is not very practical in my work, I still recommend it for expanding our view of the horizon.

Last but not least. I’d like to recommend two extra papers, which are very interesting and inspirational this year for both software and hardware areas.

Software: Dota 2 with Large Scale Deep Reinforcement Learning

Hardware: Fast Stencil-Code Computation on a Wafer-Scale Processor