Compare implementation of tf.AdamOptimizer to its paper

When I reviewed the implementation of Adam optimizer in tensorflow yesterday, I noticed that it’s code is different from the formulas that I saw in Adam’s paper. In tensorflow’s formulas for Adam are:

But the algorithm in the paper is:

Then quickly I found these words in the document of tf.AdamOptimizer:

Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.

And this time I did find the ‘Algo 2’ in the paper:

But how does ‘Algo 1’ tranform to ‘Algo 2’? Let me try to deduce them from ‘Algo 1’:

\theta_t \gets \theta_{t-1} - \frac{\alpha \cdot \hat{m_t}}{(\sqrt{\hat{v_t}} + \epsilon)}
\implies   \theta_t \gets \theta_{t-1} - \alpha \cdot \frac{m_t}{1 - \beta_1^t} \cdot \frac{1}{(\sqrt{\hat{v_t}} + \epsilon)} \quad \text{       (put } \hat{m_t} \text{ in) }
\implies   \theta_t \gets \theta_{t-1} - \alpha \cdot \frac{m_t}{1 - \beta_1^t} \cdot \frac{\sqrt{1-\beta_2^t}}{\sqrt{v_t}} \quad \text{       (put } \hat{v_t} \text{ in and ignore } \epsilon \text{) }
\implies   \theta_t \gets \theta_{t-1} - \alpha_t \cdot \frac{m_t}{\sqrt{v_t} + \hat{\epsilon}} \quad \text{add new } \hat{\epsilon} \text { to avoid zero-divide}

The bug about using hooks and MirroredStrategy in tf.estimator.Estimator

When I was using MirroedStrategy in my tf.estimator.Estimator:

and add hooks for training:

The tensorflow report errors:

Without finding any answers on google, I have to look into the code of ‘’ in tensorflow. Fortunately, the code defect is obvious:

class Estimator havn’t any private argument named ‘_distribution’ but only have ‘_train_distribution’ and ‘_eval_distribution’. So the fix is just change ‘self._distribution.unwrap(per_device_hook)[0]’ to ‘self._train_distribution.unwrap(per_device_hook)[0]’.

I had submitted a request pull for tensorflow to fix this bug in branch 1.11

Some lessons from Kaggle’s competition

About two months ago, I joined the competition of ‘RSNA Pneumonia Detection’ in Kaggle. It’s ended yesterday, but I still have many experiences and lessons to be rethinking.

1. Augmentation is extremely crucial. After using tf.image.sample_distorted_bounding_box() in my program, the mAP(mean Average Precision) of evaluating dataset thrived to a perfect number. Then I realised that I should have used radical augmentation method in the first place. Otherwise, for machine learning job such as image detection and image classification, the number of samples is only about tens of thousands which is quite small for extracting dense features. Thus we need use more powerful augmentation strategy or tools (albumentation may be a good choice).

2. SGD is good for generalisation. Previously I used Adam to acquire outstanding training accuracy. But soon after, I found it is useless since evaluating accuracy is poor. For the samples are too few, I can’t use my evaluating dataset (10% cut from original data) to correctly evaluate the score on competition leaderboard. Without choice, I have to use only SGD to train my model in last stage.

3. Use more visual monitor tools. At first my model have high training accuracy and low evaluating accuracy, but after I added too many regularisation methods (such as dropout, weight decay) both training accuracy and evaluating accuracy thrinked to a too low value. The key for regularistaion of DNN is “Keep the fitting capability of training, and then try to rise evaluating accuracy”. So if I could monitor both training and evaluating accuracy at realtime, I would not trapped in dilemma.

4. Thinking more, experimenting less. Spend more time to understand and check the source code, the mechanism of model, instead of only adjusting hyper-parameters in vain.

There are still some questions I can’t answer at present:

1. Why even Resnet-50 can’t raise my training mAP up to 0.5 ?

2. Why perfect mAP value in my evaluating dataset can’t gain good score in this competition’s leaderboard ?

Will go on to discover them.

How could it possible to assign an integer to string?

The snippet below could be compiled and run:

The result is:

I noticed that the corresponding value of key ‘banana’ is empty. The reason is I assign an integer directly to key ‘banana’ by mistake. But how could c++ compiler allow me to do this? Why doesn’t it report a compiling error?
To reveal the truth, I write another snippet:

This code could also be compiled correctly!
Then I change my code to:

This time, the compiler complained that

Seems the std::string ‘s constructor and assignment operator have totally different behavier.
After checking the document, I found the reason: std::string has assignment operator for ‘char’ !(ref)

Thus we should be much more carefully when assign number to std::string.

Move semantics in C++11

      No Comments on Move semantics in C++11

After studying an example for Move Semantics of C++11, I write a more complete code snippet:

Pay attention to last two lines in ‘move constructor’:

Since ‘move constructor’ will not set initial value for m_size and m_data of ‘v4’, the m_size and m_data of ‘v3’ will be uninitial after swaps. Adding the two lines of code means to initialize m_size and m_data of ‘v3’.

How Tensorflow set device for each Operation ?

In Tensorflow, we only need to use snippet below to assign a device to a Operation:

How dose it implement? Let’s take a look.

There is a mechanism called ‘context manager’ in Python. For example, we can use it to add a wrapper for a few codes:

The result of running this script is:

Function ‘tag()’ works like a decorator. It will do something before and after those codes laying under its ‘context’.

Tensorflow uses the same principle.

This will call class Graph’s function ‘device()’. Its implementation:

The key line is ‘self._add_device_to_stack()’. Context of ‘device’ will add device name into stack of python, and when developer create an Operation it will fetch device name from stack and set it to this Operation.
Let’s check the code routine of creating Operation:

‘self._device_function_stack.peek_objs’ is where it peek the device name from stack.

Some tips about using google’s TPU (Cont.)

Sometimes I get this error from TPUEstimator:

And after stop and restart TPU in console of GCP, the error disappeared. TPU doesn’t allow users to use it directly like GPU. You can’t see the device in VM looks like ‘/dev/tpu’ or something like this. Google provides TPU as RPC service, so you can only run DNN training through this service. I think this RPC service is not stable enough so sometimes it can’t work and lead to the error ‘Deadline Exceeded’.

When I get this type of error from TPU:

The only solution is to create a new TPU instance and delete the old one in GCP console. Seems Google need to improve the robust of their TPU RPC service.

Running 10000 steps and get ‘loss’ for every turn:

It’s quite strange that the ‘loss’ can’t go low enough. I still need to do more experiments.

Previously, I run MobileNet_v2 in a machine with Geforce GTX 960 and it could process 100 samples per second. And by using 8 TPUs of Version 2, it can process about 500 samples per second. Firstly, I am so disappointed about the performance-boosting of TPUv2, for it only has about 1.4 TFLOPS for each. But then I noticed that may be the bottleneck is not the performance of TPU, since IO is usually the limit for training speed. Besides, my model is MobileNet_v2, which is too simple and light so it can’t excavate all the capability of TPU.
Therefore I set ‘depth_multiplier=4’ for MobileNet_v2. Under this model, GTX 960 could process 21 samples per second, and TPUv2-8 could process 275 samples per second. This time, I can estimate each TPUv2 has about 4 TFLOPS. I know this metric seems too low from Google’s official 45 TFLOPS. But considering the possible bottlenecks of storage IO and network bandwidth, it becomes understandable. And also, there is another possibility: Google’s 45 TFLOPS means the half-precision operation performance 🙂

Google has just release Tensorflow 1.11 for TPU clusters. At first, I think I can use hooks in TPUEstimatorSpec now, but after adding

it reports

Certainly, the TPU is much harder to use and debug than GPU/CPU.

Some tips about using google’s TPU

About one month ago, I submit a request to Google Research Cloud for using TPU for free. Fortunately, I received the approvement yesterday. The approvement let me use 5 regular Cloud TPUs and 100 preemptible Cloud TPUs for free for 30 days with only submitting my GCP project name to it.
Then I have to change my previous Tensorflow program to let it run on TPUs. I can’t just change tf.device(‘/gpu:0’) to ‘tf.device(‘/tpu:0’) in code to run training on Google TPU. Actually, there are many documents about how to modify the code for this, such as TPUEstimator, Using TPUs etc.

Here are some tips about porting code for TPUs:

1. We can only use TPUEstimator for training

Pay attention to the ‘batch_axis’. It tells TPU pod to split data by ‘0’ dimension for data and labels, for I use ‘NHWC’ data format.

2. model_fn and data_input_fn in TPUEstimator has arguments more than regular tf.estimator.Estimator. We need to fetch some arguments (‘batch_size’) from params.

3. TPU doesn’t support the operation like

So try to avoid using them

4. Carefully use tf.dataset or else it will report data shape error. The code below could run correctly so far

5. Because using TPUEstimator, we can’t init iterator of tf.dataset in ‘’, so a little trick should be used:

6. The Tensorflow in GCP VM instance only supports loading datasets from and storing model into GCP storage.

7. There aren’t any hooks for TPUEstimator currently in Tensorflow-1.9. So I can’t see any report from console after launching a TPU program. Hope Google could improve it as soon as possible.

Some modifications about SSD-Tensorflow

In the previous article, I introduced a new library for Object Detection. But yesterday, after I added slim.batch_norm() into ‘nets/’ like this:

Although training could still run correctly, the evaluation reported errors:

I wondered why adding some simple batch_norm will make shape incorrect for quite a while. Finally I find this page from google. It said this type of error is usually made by incorrect data_format setting. Then I check the code of ‘’ and ‘’, and got the answer: the training code use ‘NCHW’ but evaluating code use ‘NHWC’!
After changing data_format to ‘NCHW’ in ‘’, the evaluation script runs successfully.

Choosing a Object Detection Framework written by Tensorflow

Recently I need to train a DNN model for object detection task. In the past, I am using the object detection framework from tensorflows’s subject — models. But there are two reasons that I couldn’t use it to do my own project:

First, it’s too big. There are more than two hundred python files in this subproject. It contains different types of models such as Resnet, Mobilenet and different type of detection algorithms such as RCNN and SSD (Single Shot Detection). Enormous codes usually mean hard to understand and customize. It might cost a lot of times to build a new project by referencing a big old project.
Second, it uses too much configuration files hence even trying to understanding these configs would be a tedious work.

Then I find another better project in GitHub: SSD-Tensorflow. It’s simple. It only includes less than fifty Python files. You can begin training only by using one command:

No configuration files, no ProtoBuf formats.
Although simple and easy to use, this project still has the capability to be remolded. We can add new dataset by changing the code in directory ‘/datasets’ and add new networks in directory ‘/nets’. Let’s see the basic implementation of its VGG for SSD:

Quit straight, isn’t it?