Robin on Linux – Page 17 – All about technology

Tips about pytest

1. Error for “fixture ‘mocker’ not found”
After running pytest, it reported:

E       fixture 'mocker' not found
>       available fixtures: cache, capfd, capsys, doctest_namespace, mock, mocker, monkeypatch, pytestconfig, record_xml_property, recwarn, request, requests_get, tmpdir, tmpdir_factory
>       use 'pytest --fixtures [testpath]' for help on them.

The solution is just installing the missing pip package:

pip install pytest-mock

2. How to make sure a function has been called without caring about its arguments?
There are two methods. The first method is using “.called”

    ...
    sender.send_email()
    mocker.patch.object(EmailSender, "_email_spec")
    assert sender._email_spec.called

The second method is using “mocker.spy()”

    ...
    spy = mocker.spy(sender, "_email_spec")
    sender.send_email()
    spy.assert_called()

Books I read in year 2019

At the beginning of 2019, I finished the book “The Great Siege: Malta 1565”. The story about a few loyal knights protecting Europe from the Ottoman Empire is so extraordinary that it encouraged me to go on my learning and working about information technology.
To find a new job about Data Engineer or Data Scientist, I almost remembered the whole book of “Hundreds of interviews about machine learning” (Title translated from Chinese). Although I haven’t found a job about machine learning (actually, it’s a job about just damned PHP and Javascript), this book gave me confidence and direction before looking for a new job.
I bought the book “Rats of NIMH” at the end of 2016, and finished reading it after more than two years. In the period, life changed tremendously for me, though I hope the end of it would be as good as the Frisby family.
The most exciting new thing I learned is about NLP in deep learning. After reading the papers about Word2Vec, Transformer, Elmo, BERT, etc. I became very familiar and interesting about NLP.
After started my new job in June 2019, I read the book “Statistical Machine Learning” (Title translated from Chinese) on the commute bus. The bus was very vibrant so I have to read the book for a while and take some rest for my eyes and repeat them. Life is not easy, so I should insist further.

The generating speed for random number in Python3

Just want to generate random number in a range (no matter float or integer) by using Python. Since I only need to get a random number in my code once a time, the speed for calling the generating-function is critical.
So let’s do the experiment:

import random
import time
import numpy as np
begin = time.time()
for i in range(10000):
    random.uniform(1, 100)
print('time:', time.time() - begin)
begin = time.time()
for i in range(10000):
    random.randrange(1, 100)
print('time:', time.time() - begin)
begin = time.time()
for i in range(10000):
    np.random.uniform(1, 100)
print('time:', time.time() - begin)

The result is:

time: 0.0025768280029296875
time: 0.00877070426940918
time: 0.022496461868286133

Looks the random.uniform() from standard library of Python3 is the fastest one. But there is still a odd phenomenon: numpy is as fast as we expected.
Actually, the correct way of using numpy.random.uniform() is setting its size argument.

begin = time.time()
for i in range(10000):
    np.random.uniform(1, 100)
print('time:', time.time() - begin)
begin = time.time()
np.random.uniform(1, 100, 10000)
print('time:', time.time() - begin)

The result is:

time: 0.022496461868286133
time: 0.00012969970703125

Thus the best way to generating a bunch of random numbers at a time is numpy.random.uniform()

A problem about using DataFrame in Apache Spark

Here is the code for loading CSV file (table employee) to DataFrame of Apache Spark:

    val schema = StructType(
      Seq(
        StructField("id", LongType),
        StructField("birthday", DateType),
        StructField("firstname", StringType),
        StructField("lastname", StringType),
        StructField("gender", StringType),
        StructField("workingdate", DateType)
      )
    )
    val df = ss.read.format("csv")
      .option("header", "false")
      .option("quote", "'")
      .schema(schema)
      .load("employees.csv")
    df.show()

But after I run the jar in Spark, it report:

+----+--------+---------+--------+------+-----------+
|  id|birthday|firstname|lastname|gender|workingdate|
+----+--------+---------+--------+------+-----------+
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
|null|    null|     null|    null|  null|       null|
+----+--------+---------+--------+------+-----------+

Seems data haven’t been correctly load.
After reviewed the document for CSV format carefully, I noticed that the quote in my CSV file is ‘ instead of “. So I added a option in my code to let Spark recognise single quote:

    val df = ss.read.format("csv")
      .option("header", "false")
      .option("quote", "'")
      .schema(schema)
      .load("employees.csv")

This time the CSV have been read out properly.

+-----+----------+---------+-----------+------+-----------+
|   id|  birthday|firstname|   lastname|gender|workingdate|
+-----+----------+---------+-----------+------+-----------+
|10001|1953-09-02|   Georgi|    Facello|     M| 1986-06-26|
|10002|1964-06-02|  Bezalel|     Simmel|     F| 1985-11-21|
|10003|1959-12-03|    Parto|    Bamford|     M| 1986-08-28|
|10004|1954-05-01|Chirstian|    Koblick|     M| 1986-12-01|
|10005|1955-01-21|  Kyoichi|   Maliniak|     M| 1989-09-12|
|10006|1953-04-20|   Anneke|    Preusig|     F| 1989-06-02|
|10007|1957-05-23|  Tzvetan|  Zielinski|     F| 1989-02-10|
|10008|1958-02-19|   Saniya|   Kalloufi|     M| 1994-09-15|
|10009|1952-04-19|   Sumant|       Peac|     F| 1985-02-18|
|10010|1963-06-01|Duangkaew|   Piveteau|     F| 1989-08-24|
|10011|1953-11-07|     Mary|      Sluis|     F| 1990-01-22|
|10012|1960-10-04| Patricio|  Bridgland|     M| 1992-12-18|
|10013|1963-06-07|Eberhardt|     Terkki|     M| 1985-10-20|
|10014|1956-02-12|    Berni|      Genin|     M| 1987-03-11|
|10015|1959-08-19| Guoxiang|  Nooteboom|     M| 1987-07-02|
|10016|1961-05-02| Kazuhito|Cappelletti|     M| 1995-01-27|
|10017|1958-07-06|Cristinel|  Bouloucos|     F| 1993-08-03|
|10018|1954-06-19| Kazuhide|       Peha|     F| 1987-04-03|
|10019|1953-01-23|  Lillian|    Haddadi|     M| 1999-04-30|
|10020|1952-12-24|   Mayuko|    Warwick|     M| 1991-01-26|
+-----+----------+---------+-----------+------+-----------+

A convenient environment to write LaTex

More than one year ago, I wrote a paper about how to accelerate Deep Learning training for sparse features and dense features (images). For writing this paper, I installed a bunch of tools and plugins in my Mac-book and fixed a lot of errors for them by searching Google. Seems preparing LaTex environment on a local computer is really a pain in the neck.
Fortunately I found a convenient way today.
First, download your favourite template. For me the best template is CVPR-2020, from which anyone could download template. The template is a zip file.
Second, go to overleaf.com, sign up a new account. Then, in the top-left of the page, click “New Project”, and click “Upload Project”, choose the zip file above.
Third, now you would see a beautiful IDE for writing LaTex.

Enjoy!

Using Single Shot Detection to detect birds (Episode four)

In the previous article, I reached mAP 0.770 for VOC2007 test.
Four months has past. After trying a lot of interesting ideas from different papers, such as FPN, celu, RFBNet, I finally realised that the data is more important than network structures. Then I use COCO2017+VOC instead of only VOC to train my model. The mAP for VOC2007 test eventually reached 0.797.
But another strange thing happens: there will be a strange big bounding box around the whole image for the 16-birds-image. After using dropout and changing augmentation policies, the strange big box still existed.
I doubt that COCO2017 dataset for birds is not general enough. Therefore I decided to use a more abundant dataset — Open Images Dataset V5. After retrieving all bird images from Open Images Dataset V5, I get 18525 images with corresponding annotations. By using them for training, I finally got a more promising bird detection result for that 16-birds-image (by using threshold 0.65):

Seems these bird images in Open Images Dataset V5 are more general than COCO2017. But the mAP of COCO evaluation is smaller for the model trained by Open Images than a model trained by COCO2017. So it looks like I need a more comprehensive evaluation metrics now.

The MySQL master-slave drift problem in AWS

About one month ago, we met a problem in MySQL master-slave architecture on AWS ec2. The MySQL master runs very fast, but the slave can only get the new data from about two or three hours ago.
We firstly suspect the resources for the master or slave instance are not enough therefore we upgrade the instance type to let them have more CPU cores and memory. But the lag problem still existed.
Only after we set binlog_group_commit_sync_delay=10000, the drift disappeared.
Let’s see the description for binlog_group_commit_sync_delay:

`binlog_group_commit_sync_delay` Controls how many microseconds the binary log commit waits before synchronizing the binary log file to disk. By default binlog_group_commit_sync_delay is set to 0, meaning that there is no delay. Setting binlog_group_commit_sync_delay to a microsecond delay enables more transactions to be synchronized together to disk at once, reducing the overall time to commit a group of transactions because the larger groups require fewer time units per group.

An example of using Spark Structured Streaming

This snippet will monitor two directories and join the data from them when there is a new CSV file in any directory.

from pyspark.sql import SQLContext
from pyspark.context import SparkContext
from pyspark.sql.types import *
sc = SparkContext()
sqlContext = SQLContext.getOrCreate(sc)
pl_schema = StructType([StructField('id', LongType(), True),
    StructField('gid', LongType(), True),
    StructField('pid', LongType(), True),
    StructField('firstlogin', IntegerType(), True)
])
pl_df = sqlContext.readStream.schema(pl_schema).csv('/tmp/pl/')
pl_df.createOrReplaceTempView('pl_mapping')
user_schema = StructType([StructField('id', LongType(), True),
    StructField('fullname', StringType(), True)
])
user_df = sqlContext.readStream.schema(user_schema).csv('/tmp/user/')
user_df.createOrReplaceTempView('users')
result = sqlContext.sql("SELECT u.id, u.fullname FROM users AS u JOIN pl_mapping AS pl ON u.id = pl.gaf_id")
query = result.writeStream.outputMode('append').format('csv').option('path', '/tmp/result/').option('checkpointLocation', '/tmp/ckpt/').start()
print('Starting')
query.awaitTermination(3600)

The join operation is implemented by Spark SQL which is easy to use (for DBA), and also easy to maintain.
Some articles said if the Spark process restart after failed, the ‘checkpoint’ would help it to continue work from last uncompleted position. I tried it in my local computer, and noticed that it do make some duplicated rows after restart. This is a severe problem for production environment so I will check it in next testings.

A problem of using Pyspark SQL

Here is the code:

from pyspark.sql import SQLContext
from pyspark.context import SparkContext
from pyspark.sql.types import *
from typing import List
sc = SparkContext()
sqlContext = SQLContext.getOrCreate(sc)
schema = StructType([StructField('id', LongType(), True),
                      StructField('gid', LongType(), True),
                      StructField('pid', LongType(), True),
                      StructField('firstlogin', IntegerType(), True)
])
row = ['2', '29', '29', '29']
df = sqlContext.createDataFrame(row, schema)
df.show()

It will report error after running ‘cat xxx.py|bin/pyspark’:

TypeError: StructType can not accept object '2' in type

I used to think it was because ‘2’ is a string, so I changed ‘row’ to be ‘[2, 29, 29, 29]’. But the error also changed to:

TypeError: StructType can not accept object 2 in type

Then I searched on google, and find this article. Looks like I forgot to transfer ‘list’ of python to ‘RDD’ of Apache Spark.
But at last, I found the real reason: I just need to add ‘[]’ between my ‘list’!
The right code is here:

row = ['2', '29', '29', '29']
df = sqlContext.createDataFrame([row], schema)

Some problems about using AWS DMS

AWS DMS is a new type of service used to migrate data from different types of database and data-warehouse. I met some problems when trying to use it in production environment.
Problem 1. When using a MySQL server of AWS RDS as the source of a replication task. It reported errors after started the task:

Last failure message
Last Error Task error notification received from subtask 0, thread 0 [reptask/replicationtask.c:2673] [1020418] Error Code [10001] : Binary Logging must be enabled for MySQL server; Errors in MySQL server binary logging configuration. Follow all prerequisites for 'MySQL as a source in DMS' from https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html or'MySQL as a target in DMS' from https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.MySQL.html ; Failed while preparing stream component 'st_0_WBK5KGUWQAH6VKEP4I5LH2EFHE'.; Cannot initialize subtask; Stream component 'st_0_WBK5KGUWQAH6VKEP4I5LH2EFHE' terminated [reptask/replicationtask.c:2680] [1020418] Stop Reason FATAL_ERROR Error Level FATAL

The failure message looks terrible. But at least I can find this doc to follow. After changed the configurations as below:

binlog_format	ROW
binlog_checksum	NONE
binlog_row_image	FULL

the error still existed.
The real answer is in here since I used RDS instead of self-managed MySQL. After I add one line Terraform code to enable “automatic backups”:

resource "aws_db_instance" "test_gaf" {
  ......
  backup_retention_period     = 10
}

the replication task began to work without the error.
Problem 2. Running replication task for a while to export data from MySQL to AWS Redshift. A new error log appeared in Redshift load logs:

019-10-29T04:41:27 [TARGET_LOAD ]E: RetCode: SQL_ERROR SqlState: XX000 NativeError: 30 Message: [Amazon][Amazon Redshift] (30) Error occurred while trying to execute a query: [SQLState XX000] ERROR: User arn:aws:redshift:us-east-1:262284277472:dbuser:analytics-20190902/masteruser is not authorized to assume IAM Role arn:aws:iam::262284277472:role/dms-access-for-endpoint DETAIL: ----------------------------------------------- error: User arn:aws:redshift:us-east-1:262284277472:dbuser:analytics-20190902/masteruser is not authorized to assume IAM Role arn:aws:iam::262284277472:role/dms-access-for-endpoint code: 8001 context: IAM Role=arn:aws:iam::262284277472:role/dms-access-for-endpoint query: 1799 location: xen_aws_credentials_mgr.cpp:321 process: padbmaster [pid=21755] ----------------------------------------------- [1022502] (ar_odbc_stmt.c:4622)

Why masteruser is not authorized? The answer is here. Below is the Terraform code:

data "aws_iam_policy_document" "dms_assume_role" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      identifiers = ["dms.amazonaws.com"]
      type        = "Service"
    }
  }
  statement {
    actions = ["sts:AssumeRole"]
    # By https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Security.APIRole.html,
    # we also need principal `redshift.amazonaws.com`
    principals {
      identifiers = ["redshift.amazonaws.com"]
      type        = "Service"
    }
  }
}

Then I had giiven “dms_assume_role” two Trusty Entities

Problem 3. There was still a error in Redshift load log (so many errors in AWS DMS…):

Error	Type	Raw Field Value
Invalid timestamp format or value [YYYY-MM-DD HH24:MI:SS]	timestamp	0000-00-00 00:00:00

Seems the answer is here. Therefore I added “acceptanydate=true;timeformat=auto” into the “extra connection settings” in Redshift endpoint. But the error just changed to:

Error	Type	Raw Field Value
Invalid data	timestamp	0000-00-00 00:00:00

After searching for almost two days, I found that the reason is in the schema of Redshift, which is automatically created by AWS DMS replication task.

CREATE TABLE my (
    ...
    mydate TIMESTAMP DEFAULT '0000-00-00 00:00:00' NOT NULL,
    ...
)

Since the schema doesn’t allow “mydate” column to be null but the “acceptanydate=true” is trying to transfer “0000-00-00 00:00:00 to null”, the final error is “Invalid data” for Redshift.
The solution for this problem is: create table of Redshift manually to let “mydate” column to be “nullable”, and change the working mode of replication task to “TRUNCATE_BEFORE_LOAD”.