Some tips about using google’s TPU (Cont.)

Sometimes I get this error from TPUEstimator:

And after stop and restart TPU in console of GCP, the error disappeared. TPU doesn’t allow users to use it directly like GPU. You can’t see the device in VM looks like ‘/dev/tpu’ or something like this. Google provides TPU as RPC service, so you can only run DNN training through this service. I think this RPC service is not stable enough so sometimes it can’t work and lead to the error ‘Deadline Exceeded’.

When I get this type of error from TPU:

The only solution is to create a new TPU instance and delete the old one in GCP console. Seems Google need to improve the robust of their TPU RPC service.

Running 10000 steps and get ‘loss’ for every turn:

It’s quite strange that the ‘loss’ can’t go low enough. I still need to do more experiments.

Previously, I run MobileNet_v2 in a machine with Geforce GTX 960 and it could process 100 samples per second. And by using 8 TPUs of Version 2, it can process about 500 samples per second. Firstly, I am so disappointed about the performance-boosting of TPUv2, for it only has about 1.4 TFLOPS for each. But then I noticed that may be the bottleneck is not the performance of TPU, since IO is usually the limit for training speed. Besides, my model is MobileNet_v2, which is too simple and light so it can’t excavate all the capability of TPU.
Therefore I set ‘depth_multiplier=4’ for MobileNet_v2. Under this model, GTX 960 could process 21 samples per second, and TPUv2-8 could process 275 samples per second. This time, I can estimate each TPUv2 has about 4 TFLOPS. I know this metric seems too low from Google’s official 45 TFLOPS. But considering the possible bottlenecks of storage IO and network bandwidth, it becomes understandable. And also, there is another possibility: Google’s 45 TFLOPS means the half-precision operation performance 🙂

Google has just release Tensorflow 1.11 for TPU clusters. At first, I think I can use hooks in TPUEstimatorSpec now, but after adding

it reports

Certainly, the TPU is much harder to use and debug than GPU/CPU.

Some tips about using google’s TPU

About one month ago, I submit a request to Google Research Cloud for using TPU for free. Fortunately, I received the approvement yesterday. The approvement let me use 5 regular Cloud TPUs and 100 preemptible Cloud TPUs for free for 30 days with only submitting my GCP project name to it.
Then I have to change my previous Tensorflow program to let it run on TPUs. I can’t just change tf.device(‘/gpu:0’) to ‘tf.device(‘/tpu:0’) in code to run training on Google TPU. Actually, there are many documents about how to modify the code for this, such as TPUEstimator, Using TPUs etc.

Here are some tips about porting code for TPUs:

1. We can only use TPUEstimator for training

Pay attention to the ‘batch_axis’. It tells TPU pod to split data by ‘0’ dimension for data and labels, for I use ‘NHWC’ data format.

2. model_fn and data_input_fn in TPUEstimator has arguments more than regular tf.estimator.Estimator. We need to fetch some arguments (‘batch_size’) from params.

3. TPU doesn’t support the operation like

So try to avoid using them

4. Carefully use tf.dataset or else it will report data shape error. The code below could run correctly so far

5. Because using TPUEstimator, we can’t init iterator of tf.dataset in ‘’, so a little trick should be used:

6. The Tensorflow in GCP VM instance only supports loading datasets from and storing model into GCP storage.

7. There aren’t any hooks for TPUEstimator currently in Tensorflow-1.9. So I can’t see any report from console after launching a TPU program. Hope Google could improve it as soon as possible.

Some modifications about SSD-Tensorflow

In the previous article, I introduced a new library for Object Detection. But yesterday, after I added slim.batch_norm() into ‘nets/’ like this:

Although training could still run correctly, the evaluation reported errors:

I wondered why adding some simple batch_norm will make shape incorrect for quite a while. Finally I find this page from google. It said this type of error is usually made by incorrect data_format setting. Then I check the code of ‘’ and ‘’, and got the answer: the training code use ‘NCHW’ but evaluating code use ‘NHWC’!
After changing data_format to ‘NCHW’ in ‘’, the evaluation script runs successfully.

Choosing a Object Detection Framework written by Tensorflow

Recently I need to train a DNN model for object detection task. In the past, I am using the object detection framework from tensorflows’s subject — models. But there are two reasons that I couldn’t use it to do my own project:

First, it’s too big. There are more than two hundred python files in this subproject. It contains different types of models such as Resnet, Mobilenet and different type of detection algorithms such as RCNN and SSD (Single Shot Detection). Enormous codes usually mean hard to understand and customize. It might cost a lot of times to build a new project by referencing a big old project.
Second, it uses too much configuration files hence even trying to understanding these configs would be a tedious work.

Then I find another better project in GitHub: SSD-Tensorflow. It’s simple. It only includes less than fifty Python files. You can begin training only by using one command:

No configuration files, no ProtoBuf formats.
Although simple and easy to use, this project still has the capability to be remolded. We can add new dataset by changing the code in directory ‘/datasets’ and add new networks in directory ‘/nets’. Let’s see the basic implementation of its VGG for SSD:

Quit straight, isn’t it?

Finding core-dump file

      No Comments on Finding core-dump file

In a new server, my program got ‘core dump’. But I haven’t found the core-dump file in the current directory as usual.
First I checked the ‘ulimit’ configuration:

Seems ok. The system will generate core-dump file when the program crashed. But where is it?
Eventually, I found out the answer: core-dump file will be generated by following pattern written in /proc/sys/kernel/core_pattern.

Therefore all the core-dump files sited in /var/coredump/ directory. The pattern setting of ‘core_pattern’ file is explained here.

Migrate blog to AWS’s ec2

      No Comments on Migrate blog to AWS’s ec2

My blog had been hosting on Linost since 2013. But recently support staff from Linost noticed me that my site has led CPU usage of the host machine to 100% so the hosting system automatically ‘limited’ my resource, which actually means my site has totally been shut down.
The first thing I want to do is trying to log in my host machine by using SSH. But unfortunately, Linost doesn’t support SSH login. Without SSH and all the Linux commands, how could I find out the problem of high load and resolve it?
Finally, I chose ec2 of AWS for my new hosing machine. In order to reduce the cost, ‘t2.nano’, the cheapest instance type, has been chosen. Although it only has 512MB memory, it’s adequate to run a basic blog on WordPress. Additionally, I bought reserved instance by paying upfront for a whole year. That really decrease the cost further (about 50% discount).
Using ec2 has another advantage: I don’t need to install Mysql/Apache/PHP/Wordpress by myself. With Jetware’s AMI (Amazon Machine Image), a basic WordPress blog could be launched with a few clicks of buttons. Jetware’s AMI uses LEMP (Linux/nginx web Engine/MySQL/PHP) as its basic software stack, and also include myPHPAdmin for management of MySQL. This AMI is totally free. The only small defect is the account of MySQL has been set to an empty password with username ‘root’. But we could fix it by simply:

By typing ‘’ in the browser, I can manage MySQL so easily:

That’s awesome! Thanks to Jetware.

Source code analysis for Autograd

Autograd is a convenient tool to automatically differentiate native Python and Numpy code.

Let’s look at an example first:

The result is 3.2

f(x) = sqaure(x) + 1, its derivative is 2*x, so the result is correct.

Function grad() actually return a ‘function object’, which is ‘grad_f’. When we call grad_f(1.6), it will ‘trace’ f(x) by:

The ‘fun’ argument is our f(x) function.

In ‘trace()’, it acutually called f() without ‘x’ but a ArrayBox object. The ArrayBox object has two purposes:

1. Go through all the operations in f() along with ‘x’, so it chould get the real result of f(x)
2. Get all the corresponding gradients of operations in f()

ArrayBox class has already override all the basic arithmetic operations, such as add/sustract/multiply/divide/square. Therefore it can catch all the operations in f(x).

After catching all the operations, ArrayBox could lookup the gradients table to get all corresponding gradients, and using chain rule get final gradient result.

The gradients table is showed as below:

Otherwise, Autograd have other tricks to complete its work. Take function wrapper ‘@primitive’ as an example. This decorator make sure users could add new custom-defined-operation into Autograd.

The source code of Autograd is nice and neat. Its examples include fully-connected-network, CNN, even RNN. Let’s take a glimpse of the implement of Adam optimizer of Autograd to feel its concise code style:

Prediction of Red Wine Quality

In Kaggle platform, there is an example dataset about Quality of Red Wine. I wrote some code for it by using scikit-learn and pandas:

The results reported by snippet above:

Looks the most important feature to predict quality of red wine is ‘alcohol’. Intuitively, right?

Use PCA (Principal Component Analysis) to blur color image

I wrote an example of blurring color picture by using PCA from scikit-learn:

But it reports

The correct solution is transforming image to 2 dimensions shape, and inverse transform it after PCA:

It works very well now. Let’s see the original image and blurring image:

Original Image

Blurring Image

Do tf.random_crop() operation on GPU

When I run code like:

it reports:

Looks operation tf.random_crop() doen’t have CUDA kernel implementation. Therefore I need to write it myself. The solution is surprisingly simple: write a function to do random_crop on one image by using tf.random_uniform() and tf.slice(), and then use tf.map_fn() to apply it on multi-images.

It can run on GPU now.