RobinDong

An example for running operation before fetching data in Tensorflow

In tensorflow, what should we do if we want run something before fetching data (such as, using queue in tensorflow)? Here is an example tested by myself:

import tensorflow as tf
from tensorflow.python.framework import ops
def my_func(x):
    print("hello")
    return x
queue = tf.FIFOQueue(10, "float32")
init  = queue.enqueue_many(([1, 2, 3],))
my_op = tf.py_func(my_func, [1.0], tf.float32)
with ops.control_dependencies([my_op]):
    inc = queue.enqueue([1],)
with tf.Session() as sess:
    sess.run(init)
    print("Before Enqueue")
    sess.run(inc)
    print("After Enqueue")

It will print

Before Enqueue
hello
After Enqueue

Successfully, we add an operation before enqueue a item into queue.

Why my model doesn’t converge?

To use Resnet-50 to run CIFAR100 dataset, I wrote a program by using Tensorflow. But when running it, the loss seems keeping in about 4.5~4.6 forever:

step: 199, loss: 4.61291, accuracy: 0
step: 200, loss: 4.60952, accuracy: 0
step: 201, loss: 4.60763, accuracy: 0
step: 202, loss: 4.62495, accuracy: 0
step: 203, loss: 4.62312, accuracy: 0
step: 204, loss: 4.60703, accuracy: 0
step: 205, loss: 4.60947, accuracy: 0
step: 206, loss: 4.59816, accuracy: 0
step: 207, loss: 4.62643, accuracy: 0
step: 208, loss: 4.59422, accuracy: 0
...

After changed models (from Resnet to fully-connect-net), optimizers (from AdamOptimizer to AdagradOptimizer), and even learning rate (from 1e-3 to even 1e-7), the phenomena didn’t change at all.
Finally, I checked the loss and the output vector step by step, and found that the problem is not in model but dataset code:

    def next_batch(self, batch_size = 64):
        images = []
        labels = []
        for i in range(self.pos, self.pos + batch_size):
            image = self.data['data'][self.pos]
            image = image.reshape(3, 32, 32)
            image = image.transpose(1, 2, 0)
            image = image.astype(np.float32) / 255.0
            images.append(image)
            label = self.data['fine_labels'][self.pos]
            labels.append(label)
        if (self.pos + batch_size) >= CIFAR100_TRAIN_SAMPLES:
            self.pos = 0
        else:
            self.pos = self.pos + batch_size
        return [images, labels]

Every batch of data have the same pictures and same labels! Than’t why the model didn’t converge. I should have used ‘i’ instead of ‘self.pos’ as index to fetch data and labels.
So in DeepLearning area, problems comes not only from models and hyper-parameters, but also dataset, or faulty codes…

Problem about using slim.batch_norm() of Tensorflow (second episode)

In previous article, I have found out the reason. But how to resolve it on Multi-GPU-Training is still a question. As the suggestion of this issue in github, I tried two way to fix the problem:
First, rewrite my Averaging-Gradients-Training to learn tf.slim.create_train_op():

...
def create_train_grads(total_loss, optimizer):
  update_ops = set(ops.get_collection(ops.GraphKeys.UPDATE_OPS))
  with ops.control_dependencies(update_ops):
    barrier = control_flow_ops.no_op(name='update_barrier')
  total_loss = control_flow_ops.with_dependencies([barrier], total_loss)
  variables_to_train = tf_variables.trainable_variables()
  grads = optimizer.compute_gradients(total_loss, variables_to_train)
  return grads
...
          cross_entropy = tf.reduce_mean(cross_entropy)
          tf.get_variable_scope().reuse_variables()
          grads = create_train_grads(cross_entropy, opt)
          tower_grads.append(grads)
...
  grads = average_gradients(tower_grads)
  grad_updates = opt.apply_gradients(grads)
  with ops.name_scope('train_op'):
    # Ensure the train_tensor computes grad_updates.
    train_op = control_flow_ops.with_dependencies([grad_updates], cross_entropy)
  # Add the operation used for training to the 'train_op' collection
  train_ops = ops.get_collection_ref(ops.GraphKeys.TRAIN_OP)
  if train_op not in train_ops:
    train_ops.append(train_op)

But unfortunately, this didn’t work at all. The inference result was still a mess.
Then, another way, I use Asynchronous-Gradient-Training and tf.slim.create_train_op():

...
          cross_entropy = tf.reduce_mean(cross_entropy)
          train_op = tf.contrib.slim.learning.create_train_op(cross_entropy, opt)
          tower_ops.append(train_op)
...
  train_step = tf.group(*tower_ops)

Now the inference works very well! And the training speed become a little bit faster than Averaging-Gradients-Training, for the Averaging Operation needs to wait multi gradients from multi GPUs.

Problem about using slim.batch_norm() of Tensorflow

After using resnet_v2_50 in tensorflow/models, I found that the inference result is totally incorrect, though the training accuracy looks very well.
Firstly, I suspected the regularization of samples:

  image = tf.image.resize_image_with_crop_or_pad(image, IMAGE_HEIGHT + 66, IMAGE_WIDTH + 66)
  image = tf.random_crop(image, [IMAGE_HEIGHT, IMAGE_WIDTH, IMAGE_CHANNELS])
  image = tf.image.random_flip_left_right(image)

Indeed I had extended the image to a too big size. But after I changing padding size to ’10’, the inference accuracy was still incorrect.
Then I checked the code about importing data:

# To avoid various formats of picture, I encode all image to 'jpeg' and write them as TFRecord
img = cv2.imread(file_name)
raw_image = cv2.imencode('.jpeg', img)[1].tostring()
....
# When importing data from TFRecord
image = tf.image.decode_image(image)

and changed my inference code as the data importing routines. But the problem still existed.
About one week past. Finally, I found this issue in Github. It explains all my questions: the cause is the slim.batch_norm(). After I adding these code to my program (learning from slim.create_train_op()):

update_ops = set(ops.get_collection(ops.GraphKeys.UPDATE_OPS))
with ops.control_dependencies(update_ops):
  barrier = control_flow_ops.no_op(name='update_barrier')
total_loss = control_flow_ops.with_dependencies([barrier], total_loss)
grads = optimizer.compute_gradients(total_loss)
...

The inference accuracy is — still low. Without other choice, I removed all slim.batch_norm() in resnet_v2.py, and at this time inference accuracy becomes the same with training accuracy.
Looks problem partly been solved, but I still need to find out why sli.batch_norm() doesn’t work well in inference …

Experiment for distributed Tensorflow

Here is my experimental code for distributed Tensorflow, which is learned from the example.

import tensorflow as tf
import argparse
import time
FLAGS = None
def main():
    print(tf.__version__)
    cluster_spec = tf.train.ClusterSpec({
        'worker': ['localhost:1829'],
        'ps': ['localhost:1057'],
        })
    if FLAGS.ps:
        server = tf.train.Server(cluster_spec, job_name = 'ps', task_index = 0)
        server.join()
    else:
        server = tf.train.Server(cluster_spec, job_name = 'worker', task_index = FLAGS.worker)
        print(server.target)
        with tf.device('/job:ps/task:0'):
            init = tf.constant_initializer([0])
            c = tf.get_variable('myc', shape = [], initializer = init)
        res = tf.add(c, 1)
        train_op = tf.assign(c, res)
        with tf.Session(target = server.target) as sess:
            c.initializer.run()
            while True:
                res = sess.run(train_op)
                print(res)
                time.sleep(1)
...

The important thing is that we need to use tf.assign() to push Variable back to Parameter Server. The operation ‘tf.add’ was about to run on the task0 of worker in this example. But if we deploy more complicated application by many tasks, things became weird: a pipeline operation sometimes even runs on ‘ps’ role! The official solution to this problem is using ‘tf.train.replica_device_setter()’， which will automatically deploy Variables to parameter servers and Operations (many replicas) to many workers. What did ‘tf.train.replica_device_setter()’ do? Let’s see the backbone code of its implementation:

def replica_device_setter(ps_tasks=0, ps_device="/job:ps",
                          worker_device="/job:worker", merge_devices=True,
                          cluster=None, ps_ops=None, ps_strategy=None):
...
  if ps_ops is None:
    # TODO(sherrym): Variables in the LOCAL_VARIABLES collection should not be
    # placed in the parameter server.
    ps_ops = ["Variable", "VariableV2", "VarHandleOp"]
  if not merge_devices:
    logging.warning(
        "DEPRECATION: It is recommended to set merge_devices=true in "
        "replica_device_setter")
  if ps_strategy is None:
    ps_strategy = _RoundRobinStrategy(ps_tasks)
  if not six.callable(ps_strategy):
    raise TypeError("ps_strategy must be callable")
  chooser = _ReplicaDeviceChooser(
      ps_tasks, ps_device, worker_device, merge_devices, ps_ops, ps_strategy)
  return chooser.device_function

All the Variables will be counted as ‘ps_ops’, and the deploy strategy for Operations will be replication, for it’s called ‘_ReplicaDeviceChooser’.

def device_function(self, op):
...
    node_def = op if isinstance(op, node_def_pb2.NodeDef) else op.node_def
    if self._ps_tasks and self._ps_device and node_def.op in self._ps_ops:
      ps_device = pydev.DeviceSpec.from_string(self._ps_device)
      current_job, ps_job = current_device.job, ps_device.job
      if ps_job and (not current_job or current_job == ps_job):
        ps_device.task = self._ps_strategy(op)
      ps_device.merge_from(current_device)
      return ps_device.to_string()
    worker_device = pydev.DeviceSpec.from_string(self._worker_device or "")
    worker_device.merge_from(current_device)
    return worker_device.to_string()

All the ‘op’ in ‘self._ps_ops’ will be put into ‘ps_device’.

Performance problem for training images on MXNet

After running my MXNet application like this snippet:

MXNET_CUDNN_AUTOTUNE_DEFAULT=0 python bird.py train 0.8 &> log

I found out that the training speed is only 300 samples per second, and the usage of GPU looks very strange:

# nvidia-smi -l |grep Default
| N/A   48C    P0   184W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   44C    P0   145W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   47C    P0    44W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   43C    P0    40W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   45C    P0   182W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   42C    P0   182W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   44C    P0    42W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   42C    P0    41W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |     21%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |     17%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |     48%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |     44%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |     41%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |     36%      Default |
| N/A   45C    P0    43W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |     98%      Default |

About two days later, I just noticed that there are some messages reported by MXNet:

INFO:root:Using 1 threads for decoding...
INFO:root:Set enviroment variable MXNET_CPU_WORKER_NTHREADS to a larger number to use more threads.

After changing my command to:

MXNET_CPU_WORKER_NTHREADS=16 MXNET_CUDNN_AUTOTUNE_DEFAULT=0 python bird.py train 0.8 &> log

the speed of training has changed to 690 samples per second, and the usage of GPU became much smoothly since it could use more CPUs to decode image now:

# nvidia-smi -l |grep Default
| N/A   40C    P0   173W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   45C    P0   182W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   42C    P0   183W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   46C    P0   163W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   43C    P0   153W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   48C    P0   181W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   44C    P0   168W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   49C    P0   190W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   45C    P0   181W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   50C    P0   136W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   46C    P0   138W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   51C    P0   186W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   47C    P0   161W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   52C    P0   212W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   48C    P0   192W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   52C    P0   155W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   48C    P0   152W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   54C    P0   180W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   49C    P0   166W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   54C    P0   194W / 250W |   4971MiB / 16276MiB |     98%      Default |

The problem of ‘bool’ type in argparse of Python 2.7

To learn the example of distributed Tensorflow, I wrote this snippet:

import argparse
FLAGS = None
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.register("type", "bool", lambda v: v.lower() == "true")
    parser.add_argument(
        "--training",
        type = bool,
        default = True,
    )
    FLAGS, unparsed = parser.parse_known_args()
    print(FLAGS)

The “parser.register()” is the Tensorflow way of register ‘bool’ type for parser. But it can’t work! In my shell, I run

python test.py --training false
python test.py --training False

They all print out “Namespace(training=True)”, which means the code above can’t change value of argument ‘training’ (My Python’s version is 2.7.5).
The correct codes should be:

def str2bool(value):
    return value.lower() == 'true'
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--training",
        type = str2bool,
        default = True,
    )

Using Python to access HBase through JPype

First, we need to write a Java function to get data from HBase:

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.DiamondAddressHelper;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.Row;
import org.apache.hadoop.hbase.client.HTableInterface;
import org.apache.hadoop.hbase.client.HTablePool;
import org.apache.hadoop.hbase.client.Result;
public class HBaseReader {
  private HTableInterface table_;
  private HTablePool pool_;
  public ArrayList fetchFeatures(String columnFamily, String qualifier, ArrayList feature_ids) {
    List batch = new ArrayList();
    for (String id : feature_ids) {
      Get get = new Get(id.getBytes());
      get.addColumn(columnFamily.getBytes(), qualifier.getBytes());
      batch.add(get);
    }
    Object[] results = new Object[batch.size()];
    try {
      table_.batch(batch, results);
    } catch (Exception e) {
      System.err.println("Error: " + e);
    }
    ArrayList list = new ArrayList();
    for (Object obj : results) {
      if (obj instanceof Result) {
        Result res = (Result)obj;
        byte[] value = res.getValue(columnFamily.getBytes(), qualifier.getBytes());
        if (value == null) {
          list.add("".getBytes());
        } else {
          list.add(value);
        }
      }
    }
    return list;
  }
  public void init(String dataid, String groupid, String tableName) {
    Configuration conf = HBaseConfiguration.create();
    conf.setBoolean(DiamondAddressHelper.DIMAOND_HBASE_UNITIZED, true);
    conf.set(DiamondAddressHelper.DIAMOND_HBASE_KEY_NEW, dataid);
    conf.set(DiamondAddressHelper.DIAMOND_HBASE_GROUP, groupid);
    try {
      pool_ = new HTablePool(conf, 100);
      table_ = pool_.getTable(tableName);
    } catch (Exception e) {
      System.err.println("Error: " + e);
    }
  }
  public void shutdown() {
    try {
      table_.close();
      pool_.close();
    } catch (Exception e) {
      System.err.println("Error: " + e);
    }
  }
}

Then use maven to build it to one jar file with all dependent libraries:


  ....
  ....
  hbasereader
  jar
  1.0.0
  Reader for HBase
  http://maven.apache.org
  
    
      org.apache.hbase
      hbase
      ....
      
        
          jdk.tools
          jdk.tools
        
      
    
    
      org.apache.hadoop
      hadoop-core
      ....
    
    
      org.apache.hadoop.thirdparty.guava
      guava
      ....
    
    
....
  
    
      
        maven-assembly-plugin
        2.6
        
          
            jar-with-dependencies
          
          
            
              com.taobao.ad.HBaseReader
            
          
        
        
          
            make-assembly
            package
            
              single

Now, we could use python to call this Class from java by using JPype:

import os
import time
import jpype
import numpy
jpype.startJVM(jpype.getDefaultJVMPath(), '-ea', '-Djava.class.path=./target/hbasereader-1.0.0-jar-with-dependencies.jar')
Reader = jpype.JClass("com.taobao.ad.HBaseReader")
reader = Reader()
ArrayList = jpype.JClass("java.util.ArrayList")
list = ArrayList()
list.add('1')
list.add('2')
list.add('3')
list.add('4')
list.add('5')
reader.init('hbase.diamond.dataid.test.hbase', 'hbase-diamond-group-name-test', 'alimama_training_image_table')
begin = time.time()
for i in range(10000):
  res = reader.fetchFeatures('ct', 'image', list)
period = time.time() - begin
print(period)
print(10000/period)
data = numpy.asarray(res[4], dtype=numpy.uint8)
print(data)
reader.shutdown()
jpype.shutdownJVM()

This python example could run correctly. But if we use it in tf.py_func(), it will core dump in libjvm.so, which is difficult to debug. So at last we choose to write operation by c++ to access HBase through Thrift server, which is better for stability and grace of architecture.

Small tips about containers in Intel Threading Building Blocks and C++11

Changing values in container

std::map table;
for (auto item : table) {
  item.second = 2;
}

The code above will not change any value in container ‘table’. ‘auto’ will become std::pair and ‘item’ will be a copy of real item in ‘table’, so modify ‘item’ will not change the actual value in container.
The correct way is:

for (auto &tem : table) {
  item.second = 2;
}

or:

for (std::map::iterator it = table.begin(); it != table.end(); ++it) {
  it->second = 2;
}

Do traversal and modification concurrently in a container
Using concurrent_hash_map like this:

typedef tbb::concurrent_hash_map CacheMap;
CacheMap cache;
....
// Thread 1
for (auto &item : cache) {
  cout << item.second << "\n";
}
// Thread 2
CacheMap::accessor ac;
cache.insert(ac, std::make_pair("hello", 123));
....

will cause the program to coredump.
The reason is that concurrent_hash_map can't be modified and traversed concurrently.
Actually, Intel figure out another solution to concurrently-traverse-and-insert: concurrent_unordered_map.
But still be careful, concurrent_unordered_map support simultaneous traversal and insertion, but not simultaneous traversal and erasure.

The CSE (Common Subexpression Elimination) problem about running custom operation in Tensorflow

Recently, we create a new custom operation in Tensorflow:

REGISTER_OP("GetImageID")
    .Input("count: int32")
    .Output("image_id: string");
using namespace tensorflow;
using namespace std;
class GetImageIDOp : public OpKernel {
 public:
  explicit GetImageIDOp(OpKernelConstruction* ctx) : OpKernel(ctx) {
  }
  void Compute(OpKernelContext* ctx) override {
    const Tensor* cnt;
    OP_REQUIRES_OK(ctx, ctx->input("count", &cnt));  // This is how we should get 'input' of Op
    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(cnt->shape()),
                errors::InvalidArgument("cnt is not a scalar: ", cnt->shape().DebugString()));
    int32 count = cnt->scalar()();
    vector image_ids;
    int num = get_image_ids(&image_ids);  // Get image_ids from somewhere, such as network, or disk
    Tensor* image_id;
    OP_REQUIRES_OK(ctx, ctx->allocate_output("image_id", TensorShape({static_cast(num)}), &image_id));
    auto imageid_flat = image_id->flat();
    for (int i = 0; i < num; i++) {
      imageid_flat(i) = image_ids[i];
    }
  }
 private:
};
REGISTER_KERNEL_BUILDER(Name("GetImageID").Device(DEVICE_CPU), GetImageIDOp);

It's as simple as the example in Tensorflow's document. But when we run this Op in session:

get_image_id_op = get_image_id(32)
with tf.Session() as sess:
  while (True):
    sess.run(get_image_id_op)

It only get image_ids from network once, and then use the result of first 'run' forever, without even call 'Compute()' function in cpp code again!
Seems Tensorflow optimized the new Op and never run it twice. My colleague give a suggestion to solve this problem by using tf.placeholder:

counter = tf.placeholder(tf.int32)
get_image_id_op = get_image_id(counter)
with tf.Session() as sess:
  while (True):
    sess.run(get_image_id_op, feed_dict = {counter: count})

Looks a little tricky. The final solution is add flag in cpp code to let new Op to avoid CSE (Common Subexpression Elimination):

REGISTER_OP("GetImageID")
    .SetIsStateful()
    .Input("count: int32")
    .Output("image_id: string");
......

Attachment of the 'CMakeLists.txt':

cmake_minimum_required(VERSION 2.8)
project(my_proj)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fPIC -std=c++11 -O2 -g2")
#set(CMAKE_MACOSX_RPATH 0)
set(CMAKE_SKIP_RPATH TRUE)
if (APPLE)
  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -undefined dynamic_lookup")
elseif (UNIX)
  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}")
endif()
set(CMAKE_SHARED_LIBRARY_SUFFIX ".so")
execute_process(COMMAND python -c "import tensorflow as tf; print(tf.sysconfig.get_include())" OUTPUT_VARIABLE tf_inc)
include_directories(${tf_inc} "my/")
link_directories("/usr/lib64/" "${CMAKE_CURRENT_SOURCE_DIR}/tensorflow-core/third_party/erpc_lib/lib/")
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/lib/)
add_library(my_op SHARED my_op.cc)
target_link_libraries(my_op tbb protobuf)