The uneasy way to implement SSDLite by myself

SSDLite is a variant of Single Shot Multi-box Detection. It uses MobileNetV2 instead of VGG as backbone. Thus it can make detection extremely fast. I was trying to implement SSDLite from the code base of ssd.pytorch. Although it’s not a easy work, I finally learn a lot from the entire process.
First, I just replace VGG with MobileNetV2 in the code. However the loss will stop to reduce after a while of training. Without knowing the reason, I have to compare my code with another open source project ssds.pytorch, to try to find the cause.
Very soon, I noticed that unlike VGG backbone, which built detection framework from 38×38 feature map, the MobileNetV2 use 19×19 feature map as its first detection layer.

“For MobileNetV1, we follow the setup in [33]. For MobileNetV2, the first layer of SSDLite is attached to the expansion of layer 15 (with output stride of 16).”

From: MobileNetV2: Inverted Residuals and Linear Bottlenecks

After changed my code as the description of this paper, the loss still couldn’t reduce in training.

In the next three weeks, I tried a lot of methods: change the aspect ratios, use SGDR to replace SGD, change the number of default boxes, even modifying the structure of neural network to be identical to to ssds.pytorch. But none of them solves the problem. There is another weird phenomenon: when I run prediction on my model, it usually gives random output for detection.

Just until last week, I noticed that my model size is about 10MB but the ssds.pytorch’s is 18MB. Why do they have different model size if their models is exactly the same? Through this clue, I eventually get the cause: a large part of my model hasn’t been back-propagated at all!
My old code only implements the forward() of MobileNetV2 which is not enough for the whole model. Therefore I add nn.ModuleList() to build model from a list of layers, as this patch:

Only the nn.ModuleList() will take all layers into back-propagation process and keep them as model weights. Otherwise, the weights will be randomly init and just use for forwarding — that’s why I get random output before.

I think I should be more carefully when adding FPN into my model in the future.

A tip about Terraform

      No Comments on A tip about Terraform

Terraform is a interesting (in my opinion) tool to implement Infrastructure-as-Code. When I first used it to write production script at yesterday, I met a error report:

After a while of searching on Google, I got the cause: it can’t find my AWS credential in my computer.
Actually I have ‘~/.aws/credentials’ file, and the ‘access_key_id’, ‘secret_access_key’ are already in it. Like this:

So why can’t Terraform get the credential? The reason is in the ‘provider’ section:

I set the ‘profile’ to ‘analytics’ at first, so the Terraform tried to find something looks like ‘[analytics]’ in ‘~/.aws/credentials’ file, and it failed. The correct way is just set ‘profile’ in ‘provider’ section to ‘default’.

Some ideas about building streaming ETL on AWS

After discussed with technical support guys from AWS, I get more information about how to use all the service of AWS to build a streaming ETL architecture, step by step.
The main architecture could be described by the diagram below:


AWS

AWS S3 is the de-facto data lake. All the data, no matter from AWS RDS or AWS Dynamo or other custom ways, could be written into AWS S3 by using some specific format, such as Apache Parquet or Apache ORC (CSV format is not recommend because it’s not suitable for data scan and data compression). Then, data engineers could use AWS Glue to extract the data from AWS S3, transform them (using PySpark or something like it), and load them into AWS Redshift.
For some frequently-used data, they could also be put in AWS Redshift for optimised query. When it is needed to join tables from both AWS S3 and AWS Redshift, we could also use AWS Redshift Spectrum.

BTW, I also joined a workshop about DataBricks’ new Unified Data Analytics and Machine Learning Platform which is built on AWS. It contains

1. Delta Lake for data storage and schema enforcement.
2. Notebook to let user directly write code and run them to process and analyze data by Apache Spark. Just like Jupyter Notebook.
3. MLFlow use above data to train machine learning model.

I used Apache Spark for learning about 4 years ago. At that time, I even need to build java/scala package by myself, upload and run it. Debugging is tedious because I can only scan logs of CLI again and again to find mistakes in code. But now, Databricks give a much more convenient solution for the data scientists and developers.


Databricks

Someone who is interesting in this platform could try free edition of it https://databricks.com/try-databricks

The weird comparison behaviours of Python string

A part of my code didn’t work well, as below:

It will not print out anything totally. So I directly printed out the actually value of mapping[‘colour’]:

Why ‘Red’ is not ‘Red’? After changed the judgement from ‘is’ to ‘==’, the result became correct.
The key is that UNICODE string and normal string is different in Python:

Seems we should use ‘==’ to compare two strings instead of ‘is’.

Investigating about Streaming ETL solutions

Normal ETL solutions need to deliver all data from transactional databases to data warehouse. For instance, DBAs or Data Scientists usually deploy a script to export whole table from database to data warehouse each hour. To accelerate this process, we decided to use Streaming ETL solution in AWS(or GCP, if possible).

Firstly, I tested the AWS Data Pipeline. Although it’s called ‘Pipeline’, it needs a Last Modified Column in customer’s MySQL table so it could decide which part of the table should be extracted in each turn. The new rows, which means their Last Modified Column values had been updated, will be extracted. However, our MySQL tables don’t have this column, and adding these column and corresponding logics in code will be too tedious for a old infrastructure. The AWS Data Pipeline is not a suitable solution for us.

Then, I found the tutorial and my colleague found another doc at the same time. Combining these two suggestions, I thought out a viable solution:

  1. A in-house service using pymysqlreplication and boto3 to parse binlog from MySQL, and write these parsed-out events into AWS Kinesis (or Kafka)
  2. Another in-house service read these events and exported them into AWS RedShift

Since the AWS Redshift is a columnar storage data warehouse, inserting/updating/deleting data one by one will severely hurts its performance. So We need to use S3 service to store the intermediate files, and ‘COPY’ command to batch the operations, as below:


AWS Redshift

Tips about Numpy and PyTorch

1. Type convertion in Numpy
Here is my code:

Guess what? The type of variable ‘c’ is ‘float64’! Seems Numpy automatically considers a empty array of Python as ‘float64’ type. So the correct code should be:

This time, the type of ‘c’ is ‘int64’

2. Convert a tensor of PyTorch to ‘uint8’
If we want to convert a tensor of PyTorch to ‘float’, we can use tensor.float(). If we want to convert it to ‘int32’, we can use tensor.int().
But if we want to convert the type to ‘uint8’, what should we do? There isn’t any function named ‘uint8()’ for a tensor.
Actually, it’s much quite simple than I expect:

Using Single Shot Detection to detect birds (Episode three)

In the previous article, I reached mAP 0.740 for VOC2007 test. After one month, I found out that the key to boost the performance of object detction is not only based on cutting edge model, but also depends on sophisticated augmentation methodology. Therefore I manually checked every image generated by ‘utils/augmentations.py‘. Soon, some confusing images came out:






There are lots of shining noise in these images. The reason is we only use add-operation and multiply-operation to change the contrast/brightness of images, and this may cause some pixels overflow. To prevent it, I use clip() from numpy:

Now the images looks much normal:





After this tiny modification, the mean AP jumped from 0.740 to 0.769. This is the power of fine-tunned augmentation!

Afterward, I continued to change the augmentation function Expand() in ‘utils/augmentations.py’. The origin code use a fixed value to build a ‘background’ for all images. Then my program will randomly choose images from VOC2012 (crop out foreground objects) as the background. It looks like below:






This method is borrowed from mixup[1,2]. And by using it, the mean AP even reached 0.770.

Some tips for opencv-python

      No Comments on Some tips for opencv-python

Type conversion
Using opencv-python to add object-detection rectangle for the image:

The result looks like this




But in a more complicated program, I processed a image from float32 type. Therefore the code looks like:

But this time, the rectangle disappeared.




The reason is opencv-python use numpy array for image in type of ‘uint8’, not ‘int’! The correct code should be

Check source of image

This code snippet reported error:

Seems the argument ‘img’ is not a correcttype. So I blindly changed the code to convert ‘img’ to ‘UMat’.

It also reported another more inexplicable error:

After long time searching, I finally get the cause: the function ‘somefunc()’ returned a tuple ‘(img, target)’ instead of only ‘img’…
I should look more closely into argument ‘img’ before changing the code.

Get the type of engine for a table in MySQL

To view show the type of engine a MySQL table used, we could type:

Although the command is simple, the output is too much. We could also use a slightly more complicated command to output briefly:

Use docker as normal user

      No Comments on Use docker as normal user

I have used docker for more than 4 years, although not in product environment. Until last week, my colleague told that docker can be used as non-root user.
The document is here.
I just need to

So easy.