To construct DataFrame more effectively

The old code of python looks like:

This snippet above will cost 7 seconds to run on my laptop.
Actually, pd.concat() is an expensive operation for CPU. So let’s replace it with common python dictionary:

This snippet only costs 0.03 seconds, which is more effective.

Some problems when using GCP

After I launched a compute engine with container, it report error:

gcr.io/xx/xx-xx/feature:yy
Feb 03 00:12:28 xx-d19b201 konlet-startup[4664]: {“errorDetail”:{“message”:”failed to register layer: Error processing tar file(exit status 1): write /xxx/2020-01-16/base_cmd/part-00191-2e99af0e-1615-42af-9c60-910f9a9e6a17-c000.snappy.parquet: no space left on device”},”error”:”failed to register layer: Error processing tar file(exit status 1): write /xxx/2020-01-16/base_cmd/part-00191-2e99af0e-1615-42af-9c60-910f9a9e6a17-c000.snappy.parquet: no space left on device”}

The key is in the no space left on device. Then I use df to see the disk space:

Obviously the space on /mnt/stateful_partition has been used out. The solution is simple: add new argument for gcloud command

Another problem occurred when I trying to launch an instance of Cloud Run. It reported a mess:

Traceback (most recent call last): File “/usr/local/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py”, line 98, in refresh request, service_account=self._service_account_email File “/usr/local/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py”, line 241, in get_service_account_token request, “instance/service-accounts/{0}/token”.format(service_account) File “/usr/local/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py”, line 172, in get response, google.auth.exceptions.TransportError: (“Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/564585695625-compute@developer.gserviceaccount.com/token from the Google Compute Enginemetadata service. Status: 500 Response:\nb’Could not fetch URI /computeMetadata/v1/instance/service-accounts/564585695625-compute@developer.gserviceaccount.com/token\\n'”, )

Actually, the reason is quite simple: I haven’t realized that Cloud Run need its instance to listen on PORT. Otherwise, the service will not be launched successfully.

Problem about installing Kubeflow

Try to install Kubeflow by following this guide. But when I run

it reports

It did cost me some time to find the solution. So let’s try to make it short:

  1. Download file https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_k8s_istio.0.7.1.yaml, and find some of its bottom lines:
  2. Download the https://github.com/kubeflow/manifests/archive/v0.7-branch.tar.gz, untar it, and then there will be a new directory “manifests-0.7-branch”
  3. Change the “uri:” in kfctl_k8s_istio.0.7.1.yaml to “uri: /full/path/manifests-0.7-branch”

Now, we could run kfctl apply -V -f ${CONFIG_URI} successfully.
Seems although Kubeflow has been developed for almost two years, there are still some basic problem exists in it. A little disappointment to me.

Directly deploy containers on GCP VM instance

We can directly deploy containers into VM instance of Google Compute Engine, instead of launching a heavy Kubernetes cluster. The command looks like:

To add enviroment variables to this container, we just need to add an argument:

To let the container run command for us, we need to add command arguments:

There is still a problem: the VM instance will run this container again and again even the result of the task in container is successful.
To solve this, we just need to add another argument:

How to ignore illegal sample of dataset in PyTorch?

I have implemented a dataset class for my image samples. But it can’t handle the situation that a corrupted image has been read:

The correct solution is in Pytorch Forum. Therefore I changed my code:

But it reports:

Seems default_collate() couldn’t recognize the ‘filter’ object. Don’t worry. We can just add a small function: list()

Tips about pytest

      No Comments on Tips about pytest

1. Error for “fixture ‘mocker’ not found”
After running pytest, it reported:

The solution is just installing the missing pip package:

2. How to make sure a function has been called without caring about its arguments?
There are two methods. The first method is using “.called”

The second method is using “mocker.spy()”

Books I read in year 2019

      No Comments on Books I read in year 2019

At the beginning of 2019, I finished the book “The Great Siege: Malta 1565”. The story about a few loyal knights protecting Europe from the Ottoman Empire is so extraordinary that it encouraged me to go on my learning and working about information technology.
To find a new job about Data Engineer or Data Scientist, I almost remembered the whole book of “Hundreds of interviews about machine learning” (Title translated from Chinese). Although I haven’t found a job about machine learning (actually, it’s a job about just damned PHP and Javascript), this book gave me confidence and direction before looking for a new job.
I bought the book “Rats of NIMH” at the end of 2016, and finished reading it after more than two years. In the period, life changed tremendously for me, though I hope the end of it would be as good as the Frisby family.

The most exciting new thing I learned is about NLP in deep learning. After reading the papers about Word2Vec, Transformer, Elmo, BERT, etc. I became very familiar and interesting about NLP.
After started my new job in June 2019, I read the book “Statistical Machine Learning” (Title translated from Chinese) on the commute bus. The bus was very vibrant so I have to read the book for a while and take some rest for my eyes and repeat them. Life is not easy, so I should insist further.

The generating speed for random number in Python3

Just want to generate random number in a range (no matter float or integer) by using Python. Since I only need to get a random number in my code once a time, the speed for calling the generating-function is critical.

So let’s do the experiment:

The result is:

Looks the random.uniform() from standard library of Python3 is the fastest one. But there is still a odd phenomenon: numpy is as fast as we expected.
Actually, the correct way of using numpy.random.uniform() is setting its size argument.

The result is:

Thus the best way to generating a bunch of random numbers at a time is numpy.random.uniform()

A problem about using DataFrame in Apache Spark

Here is the code for loading CSV file (table employee) to DataFrame of Apache Spark:

But after I run the jar in Spark, it report:

Seems data haven’t been correctly load.
After reviewed the document for CSV format carefully, I noticed that the quote in my CSV file is instead of . So I added a option in my code to let Spark recognise single quote:

This time the CSV have been read out properly.

A convenient environment to write LaTex

More than one year ago, I wrote a paper about how to accelerate Deep Learning training for sparse features and dense features (images). For writing this paper, I installed a bunch of tools and plugins in my Mac-book and fixed a lot of errors for them by searching Google. Seems preparing LaTex environment on a local computer is really a pain in the neck.
Fortunately I found a convenient way today.
First, download your favourite template. For me the best template is CVPR-2020, from which anyone could download template. The template is a zip file.
Second, go to overleaf.com, sign up a new account. Then, in the top-left of the page, click “New Project”, and click “Upload Project”, choose the zip file above.
Third, now you would see a beautiful IDE for writing LaTex.




Enjoy!