How to ignore illegal sample of dataset in PyTorch?

I have implemented a dataset class for my image samples. But it can’t handle the situation that a corrupted image has been read:

The correct solution is in Pytorch Forum. Therefore I changed my code:

But it reports:

Seems default_collate() couldn’t recognize the ‘filter’ object. Don’t worry. We can just add a small function: list()

Tips about pytest

      No Comments on Tips about pytest

1. Error for “fixture ‘mocker’ not found”
After running pytest, it reported:

The solution is just installing the missing pip package:

2. How to make sure a function has been called without caring about its arguments?
There are two methods. The first method is using “.called”

The second method is using “mocker.spy()”

Books I read in year 2019

      No Comments on Books I read in year 2019

At the beginning of 2019, I finished the book “The Great Siege: Malta 1565”. The story about a few loyal knights protecting Europe from the Ottoman Empire is so extraordinary that it encouraged me to go on my learning and working about information technology.
To find a new job about Data Engineer or Data Scientist, I almost remembered the whole book of “Hundreds of interviews about machine learning” (Title translated from Chinese). Although I haven’t found a job about machine learning (actually, it’s a job about just damned PHP and Javascript), this book gave me confidence and direction before looking for a new job.
I bought the book “Rats of NIMH” at the end of 2016, and finished reading it after more than two years. In the period, life changed tremendously for me, though I hope the end of it would be as good as the Frisby family.

The most exciting new thing I learned is about NLP in deep learning. After reading the papers about Word2Vec, Transformer, Elmo, BERT, etc. I became very familiar and interesting about NLP.
After started my new job in June 2019, I read the book “Statistical Machine Learning” (Title translated from Chinese) on the commute bus. The bus was very vibrant so I have to read the book for a while and take some rest for my eyes and repeat them. Life is not easy, so I should insist further.

The generating speed for random number in Python3

Just want to generate random number in a range (no matter float or integer) by using Python. Since I only need to get a random number in my code once a time, the speed for calling the generating-function is critical.

So let’s do the experiment:

The result is:

Looks the random.uniform() from standard library of Python3 is the fastest one. But there is still a odd phenomenon: numpy is as fast as we expected.
Actually, the correct way of using numpy.random.uniform() is setting its size argument.

The result is:

Thus the best way to generating a bunch of random numbers at a time is numpy.random.uniform()

A problem about using DataFrame in Apache Spark

Here is the code for loading CSV file (table employee) to DataFrame of Apache Spark:

But after I run the jar in Spark, it report:

Seems data haven’t been correctly load.
After reviewed the document for CSV format carefully, I noticed that the quote in my CSV file is instead of . So I added a option in my code to let Spark recognise single quote:

This time the CSV have been read out properly.

A convenient environment to write LaTex

More than one year ago, I wrote a paper about how to accelerate Deep Learning training for sparse features and dense features (images). For writing this paper, I installed a bunch of tools and plugins in my Mac-book and fixed a lot of errors for them by searching Google. Seems preparing LaTex environment on a local computer is really a pain in the neck.
Fortunately I found a convenient way today.
First, download your favourite template. For me the best template is CVPR-2020, from which anyone could download template. The template is a zip file.
Second, go to overleaf.com, sign up a new account. Then, in the top-left of the page, click “New Project”, and click “Upload Project”, choose the zip file above.
Third, now you would see a beautiful IDE for writing LaTex.




Enjoy!

Using Single Shot Detection to detect birds (Episode four)

In the previous article, I reached mAP 0.770 for VOC2007 test.
Four months has past. After trying a lot of interesting ideas from different papers, such as FPN, celu, RFBNet, I finally realised that the data is more important than network structures. Then I use COCO2017+VOC instead of only VOC to train my model. The mAP for VOC2007 test eventually reached 0.797.
But another strange thing happens: there will be a strange big bounding box around the whole image for the 16-birds-image. After using dropout and changing augmentation policies, the strange big box still existed.
I doubt that COCO2017 dataset for birds is not general enough. Therefore I decided to use a more abundant dataset — Open Images Dataset V5. After retrieving all bird images from Open Images Dataset V5, I get 18525 images with corresponding annotations. By using them for training, I finally got a more promising bird detection result for that 16-birds-image (by using threshold 0.65):




Seems these bird images in Open Images Dataset V5 are more general than COCO2017. But the mAP of COCO evaluation is smaller for the model trained by Open Images than a model trained by COCO2017. So it looks like I need a more comprehensive evaluation metrics now.

The MySQL master-slave drift problem in AWS

About one month ago, we met a problem in MySQL master-slave architecture on AWS ec2. The MySQL master runs very fast, but the slave can only get the new data from about two or three hours ago.
We firstly suspect the resources for the master or slave instance are not enough therefore we upgrade the instance type to let them have more CPU cores and memory. But the lag problem still existed.
Only after we set binlog_group_commit_sync_delay=10000, the drift disappeared.
Let’s see the description for binlog_group_commit_sync_delay:

binlog_group_commit_sync_delay Controls how many microseconds the binary log commit waits before synchronizing the binary log file to disk. By default binlog_group_commit_sync_delay is set to 0, meaning that there is no delay. Setting binlog_group_commit_sync_delay to a microsecond delay enables more transactions to be synchronized together to disk at once, reducing the overall time to commit a group of transactions because the larger groups require fewer time units per group.

An example of using Spark Structured Streaming

This snippet will monitor two directories and join the data from them when there is a new CSV file in any directory.

The join operation is implemented by Spark SQL which is easy to use (for DBA), and also easy to maintain.

Some articles said if the Spark process restart after failed, the ‘checkpoint’ would help it to continue work from last uncompleted position. I tried it in my local computer, and noticed that it do make some duplicated rows after restart. This is a severe problem for production environment so I will check it in next testings.

A problem of using Pyspark SQL

Here is the code:

It will report error after running ‘cat xxx.py|bin/pyspark’:

I used to think it was because ‘2’ is a string, so I changed ‘row’ to be ‘[2, 29, 29, 29]’. But the error also changed to:

Then I searched on google, and find this article. Looks like I forgot to transfer ‘list’ of python to ‘RDD’ of Apache Spark.
But at last, I found the real reason: I just need to add ‘[]’ between my ‘list’!
The right code is here: