tensorflow

Do tf.random_crop() operation on GPU

When I run code like:

with tf.device('/GPU:0'):
  images = tf.random_crop(images, [IMAGE_HEIGHT, IMAGE_WIDTH, IMAGE_CHANNELS])
...

it reports:

Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.

Looks operation tf.random_crop() doen’t have CUDA kernel implementation. Therefore I need to write it myself. The solution is surprisingly simple: write a function to do random_crop on one image by using tf.random_uniform() and tf.slice(), and then use tf.map_fn() to apply it on multi-images.

def my_random_crop(value, size):
    shape = tf.shape(value)
    size = tf.convert_to_tensor(size, dtype = tf.int32)
    limit = shape - size + 1
    offset = tf.random_uniform(tf.shape(shape), dtype = size.dtype, maxval = size.dtype.max) % limit
    return tf.slice(value, offset, size)
...
images = tf.map_fn(lambda img: my_random_crop(img, [IMAGE_HEIGHT, IMAGE_WIDTH, IMAGE_CHANNELS]), images)

It can run on GPU now.

Regularization loss in ‘slim’ library of Tensorflow

My python code using slim library to train classification model in Tensorflow:

    with tf.contrib.slim.arg_scope(mobilenet_v2.training_scope(weight_decay = 0.001)):
      logits, _ = mobilenet_v2.mobilenet(images, NUM_CLASSES)
    cross_entropy = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
    cross_entropy = tf.reduce_mean(cross_entropy)
    global_step = tf.contrib.framework.get_or_create_global_step()
    train_op = tf.contrib.slim.learning.create_train_op(cross_entropy, opt, global_step = global_step)
...
    sess.run(train_op)

It works fine. However, no matter what value the ‘weight_decay’ is, the training accuracy of the model could reach higher than 90% easily. It seems ‘weight_decay’ just doesn’t work.
In order to find out the reason, I reviewed the code of Tensorflow for ‘tf.losses.sparse_softmax_cross_entropy()’:

# tensorflow/python/ops/losses/losses_impl.py
@tf_export("losses.sparse_softmax_cross_entropy")
def sparse_softmax_cross_entropy(
    labels, logits, weights=1.0, scope=None,
    loss_collection=ops.GraphKeys.LOSSES,
    reduction=Reduction.SUM_BY_NONZERO_WEIGHTS):
...
  with ops.name_scope(scope, "sparse_softmax_cross_entropy_loss",
                      (logits, labels, weights)) as scope:
    # As documented above in Args, labels contain class IDs and logits contains
    # 1 probability per class ID, so we expect rank(logits) - rank(labels) == 1;
    # therefore, expected_rank_diff=1.
    labels, logits, weights = _remove_squeezable_dimensions(
        labels, logits, weights, expected_rank_diff=1)
    losses = nn.sparse_softmax_cross_entropy_with_logits(labels=labels,
                                                         logits=logits,
                                                         name="xentropy")
    return compute_weighted_loss(
        losses, weights, scope, loss_collection, reduction=reduction)

The ‘losses.sparse_softmax_cross_entropy()’ simply call ‘tf.nn.sparse_softmax_cross_entropy()’. Then let’s look into the implementation of ‘compute_weighted_loss()’:

# tensorflow/python/ops/losses/losses_impl.py
@tf_export("losses.compute_weighted_loss")
def compute_weighted_loss(
    losses, weights=1.0, scope=None, loss_collection=ops.GraphKeys.LOSSES,
    reduction=Reduction.SUM_BY_NONZERO_WEIGHTS):
...
      loss = math_ops.cast(loss, input_dtype)
      util.add_loss(loss, loss_collection)
      return loss
What the secret in 'util.add_loss()'?
# tensorflow/python/ops/losses/util.py
@tf_export("losses.add_loss")
def add_loss(loss, loss_collection=ops.GraphKeys.LOSSES):
...
  if loss_collection:
    ops.add_to_collection(loss_collection, loss)

The losses of 'losses.sparse_softmax_cross_entropy()' will be added into collection of 'GraphKeys.LOSSES'. Then where dose the weight of parameters go ? Will they be added into same collection ? Let's check. All the layer written by library of 'tf.layers' or 'tf.contrib.slim' are inherited from 'class Layer' and will call 'add_loss()' when this layer call 'add_variable()'. Let's check 'add_loss()' of base class 'Layer':
@tf_export('layers.Layer')
class Layer(checkpointable.CheckpointableBase):
...
    def add_loss(self, losses, inputs=None):
        ...
        _add_elements_to_collection(losses, ops.GraphKeys.REGULARIZATION_LOSSES)

It's weird. The loss from weight of variable has not been added into 'GraphKeys.LOSSES', but 'GraphKeys.REGULARIZATION_LOSSES'. Then how could we get all the losses at training stage ? After grep 'REGULARIZATION_LOSSES' in whole codes of Tensorflow, it comes up with the 'get_total_loss()':
# tensorflow/python/ops/losses/util.py
@tf_export("losses.get_total_loss")
def get_total_loss(add_regularization_losses=True, name="total_loss"):
...
  losses = get_losses()
  if add_regularization_losses:
    losses += get_regularization_losses()
  return math_ops.add_n(losses, name=name)

That is the secret of losses in 'tf.layers' and 'tf.contrib.slim': we should use 'get_total_loss()' to fetch model loss and regularization loss together!

After changing my code:
    cross_entropy = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
    cross_entropy = tf.reduce_mean(cross_entropy)
    global_step = tf.contrib.framework.get_or_create_global_step()
    loss = tf.contrib.slim.losses.get_total_loss()
    train_op = tf.contrib.slim.learning.create_train_op(loss, opt, global_step = global_step)
...
    sess.run(train_op)

The 'weight_decay' works well now (which means training accuracy could not reach high value easily)

Using multi-GPUs for training in distributed environment of Tensorflow

I am trying to write code for training on multi-GPUs. The code is mainly from the example of ‘Distributed Tensorflow‘. I have changed the code slightly for runing on GPU:

...
tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d/GPU:%d" % (FLAGS.task_index, FLAGS.task_index),
        cluster=cluster)
...

But after launch the script below:

python model.py train 0.9 0.0001 0.53 ps 0 &> ps.log &
python model.py train 0.9 0.0001 0.53 worker 0 &> worker0.log &
python model.py train 0.9 0.0001 0.53 worker 1 &> worker1.log &
...

it reports:

Traceback (most recent call last):
  File "model.py", line 175, in 
    server = tf.train.Server(cluster, job_name = job_name, task_index = task_index)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 147, in __init__
    self._server_def.SerializeToString(), status)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 11721506816

Seems one MonitoredTrainingSession will occupy all the memory of GPUs. After search on google, I finally get a solution: ‘CUDA_VISIBLE_DEVICES’.
Firstly, change ‘replica_device_setter’:

...
tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d/GPU:0" % FLAGS.task_index,
        cluster=cluster)
...

and then use this shell script to launch training processes:

CUDA_VISIBLE_DEVICES=0 python model.py train 0.9 0.0001 0.53 ps 0 &> ps.log &
sleep 1
for i in `seq 0 2`; do
  dev=`expr ${i} + 1`
  CUDA_VISIBLE_DEVICES=${dev} stdbuf -o0 python model.py train 0.9 0.0001 0.53 worker ${i} &> worker_${i}.log &
  sleep 1
done

The ‘ps’ will only use GPU0, ‘worker0’ will only use GPU1, ‘worker1’ will only use GPU2 etc.

Problems and solutions about building Tensorflow-1.8 with TensorRT 4.0

Problem:
When compiling Tensorflow-1.8 with CUDA-9.2, it reports:

bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Spython_Cgen_Unn_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so: undefined reference to `cublasGemmEx@libcublas.so.9.0'
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Spython_Cgen_Unn_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so: undefined reference to `cublasZhpmv_v2@libcublas.so.9.0'
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Spython_Cgen_Unn_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so: undefined reference to `cufftExecD2Z@libcufft.so.9.0'
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Spython_Cgen_Unn_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so: undefined reference to `cublasSrotg_v2@libcublas.so.9.0'
bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Spython_Cgen_Unn_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so: undefined reference to `cufftExecR2C@libcufft.so.9.0'
...

Solution:
Add ‘/usr/local/cuda-9.2/lib64’ into ‘/etc/ld.so.conf’ and run ‘sudo ldconfig’ to make it works.
Problem:
When compiling Tensorflow-1.8, it reports:

./tensorflow/python/client/tf_session_helper.h:19:20: fatal error: Python.h: No such file or directory
...

Solution:
In ‘.tf_configure.bazelrc’ file, use real python location instead of soft link:

#don't use "/usr/bin/python"
build --action_env PYTHON_BIN_PATH="/usr/bin/python2.7"

Problem:
When running TensorRT, it reports:

ImportError: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/web_server/dlpy72/dlpy/lib/python2.7/site-packages/tensorrt/infer/_nv_infer_bindings.so)

Solution:
Run TensorRT with LD_LIBRARY_PATH:

LD_LIBRARY_PATH=/usr/local/gcc-5.3/lib64:$LD_LIBRARY_PATH python run_tensorrt.py

Testing performance of Tensorflow’s fixed-point-quantization on x86_64 cpu

Google has published their quantization method on this paper. It use int8 to run feed-forward but float32 for back-propagation, since back-propagation need more accurate to accumulate gradients. I got a question right after reading the paper: why all the performance test works are on platform of mobile-phone (ARM architecture)? The quantization consequences of model in google’s method doesn’t only need addition and multiplication of int8 numbers, but also bit-shift operations. The AVX instruments set in Intel x86_64 architecture could accelerate MAC (Multiplication, Addition and aCcumulation), but couldn’t boost bit-shift operations.
To verify my suspicion, I wrote a model with ResNet-50 (float32) to classify CIFAR-100 dataset. After running a few epochs, I evaluate the speed of inference by using my ‘eval.py’. The result is:

Time: 5.58819s

Then, I follow these steps to add tf.contrib.quantize.create_training_graph() and tf.contrib.quantize.create_eval_graph() into my code. This time, the speed of inference is:

Time: 6.23221s

A little bit of disappointment. Using quantized (int8) version of model could not accelerate processing speed of x86 CPU. May be we need to find other more powerful quantization algorithm.
Appendix:

# eval.py
from input_data import Cifar100Data
import tensorflow as tf
import numpy as np
import resnet_v2
import argparse
import time
import sys
EVAL_SAMPLES = 10000
BATCH_SIZE = 10000
MODEL_PATH = './models/'
MODEL_NAME = 'cifar_resnet_50'
def cnn_part(images):
    print(images.shape)
    ivg, _ = resnet_v2.resnet_v2_50(images, 100)
    return ivg
def main(_):
    with tf.device('/cpu:0'):
        images = tf.placeholder(tf.float32, [BATCH_SIZE, 32, 32, 3])
        labels = tf.placeholder(tf.int64, [BATCH_SIZE])
    with tf.contrib.slim.arg_scope([tf.contrib.slim.conv2d],
                        weights_initializer = tf.truncated_normal_initializer(mean = 0, stddev = 0.1)):
        image_vector = cnn_part(images)
    loss = tf.losses.sparse_softmax_cross_entropy(labels = labels, logits = image_vector)
    loss = tf.reduce_mean(loss)
    opt = tf.train.AdamOptimizer(1e-3)
    train_op = tf.contrib.slim.learning.create_train_op(loss, opt)
    correct_prediction = tf.equal(tf.argmax(image_vector, 1), labels)
    correct_prediction = tf.cast(correct_prediction, tf.float32)
    accuracy = tf.reduce_mean(correct_prediction)
    data = Cifar100Data('/disk3/cifar/cifar-100-python/test')
    saver = tf.train.Saver()
    with tf.Session() as sess:
        with tf.gfile.FastGFile('./models/cifar_resnet_50_quant.pb') as fl:
            graph_def = tf.GraphDef()
            graph_def.ParseFromString(fl.read())
        tf.import_graph_def(graph_def, name = '')
        saver.restore(sess, MODEL_PATH + MODEL_NAME + '-' + str(FLAGS.epoch))
        batch = data.next_batch(BATCH_SIZE)
        for i in range(3):
            begin = time.time()
            res = sess.run(accuracy, feed_dict = {images: batch[0], labels: batch[1]})
            print("Time: %gs" % (time.time() - begin))
            print(res)
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--epoch', type=str,
                        default='8',
                        help='Epoch of checkpoint for evaluation')
    FLAGS, unparsed = parser.parse_known_args()
    tf.app.run(main = main, argv = [sys.argv[0]] + unparsed)

Hard training works in deep learning

This week, I was trying to train two deep-learning models. They are different from my previous training job: they are really hard to converge to a small ‘loss’.
The first model is about bird image classification. Previously we wrote a modified Resnet-50 model by using MXNet and could use it to reach 78% evaluation-accuracy. But after we rewrote the same model by using Tensorflow, it could only reach 50% evaluation-accuracy, which seems very weird. The first thing that in my mind is that it’s a regularization problem, so I randomly pad/crop and rotate the training images:

  image = tf.image.resize_image_with_crop_or_pad(image, IMAGE_HEIGHT + 80, IMAGE_WIDTH + 80)
  image = tf.contrib.image.rotate(image, tf.random_uniform([1], minval = -math.pi / 3.0, maxval = math.pi / 3.0))
  image = tf.random_crop(image, [IMAGE_HEIGHT, IMAGE_WIDTH, IMAGE_CHANNELS])

By data augmentation, the evaluation accuracy rise to about 60%, but still far from the result of MXNet.
Then I change the optimizer from AdamOptimizer to GradientDescentOptimizer, since my colleague tell me the AdamOptimizer is too powerful that it tends to cause overfit. And I also add ‘weight_decay’ for my Resnet-50 model. This time, the evaluation accuracy shrived to 76%. The affection of ‘weight_decay’ is significantly positive.
The second model is about object detection. We just use the example of Tensorflow’s model library. It includes many cutting-edge models to implement object detection. I just want to try SSD(Single Shot Detection) on MobileNetV2:

python object_detection/train.py \
  --logtostderr \
  --pipeline_config_path=/disk3/donghao/models/research/object_detection/samples/configs/ssd_mobilenet_v2_coco.config \
  --train_dir=/disk3/donghao/myckpt/ \
  --num_clones=2

The loss is rapidly reducing from hundreds to twelve, but stay at eleven for a very long time. The loss looks like will stay here forever. Then I begin to adjust hyper-parameters. After testing several learning rates and optimizers, the results doesn’t change at all.
Eventually, I noticed that the loss doesn’t stay forever, it WILL REDUCE AT LAST. For some tasks such as classification, its loss will converge significantly. But for other tasks such as object detection, its loss will shrink at extremely slow speed. Use AdamOptimizer and small learning rate is a better choice for this type of task.

To check abnormal loss value when training a new model

Yesterday I wrote a Tensorflow program to train CIFAR100 dataset with Resnet-50 model. But when the training begin, I saw the ‘loss’ of classification is abnormally big and didn’t reduce at all:

loss[2.6032338e+25]
loss[2.5617402e+25]
loss[3.3851871e+25]
loss[3.092054e+25]
...

Firstly, I thought the code for processing dataset may be wrong. But after print out the data in console, the loading input data seems all right. Then I print all the value of tensors right after initialization of model. And these value seems correct either.
Without other choices, I began to check the initializer in Tensorflow code:

    with slim.arg_scope([slim.conv2d],
                        weights_initializer = tf.truncated_normal_initializer(mean = 0, stddev = 0.1)):
      img_inf, _ = resnet_v2.resnet_v2_50(image, NUM_CLASSES)

If the loss is too big, maybe I could decrease the initial value of tensors in model? Then I change ‘mean’ from ‘0’ to ‘0.1’ for ‘slim.conv2d’:

                        weights_initializer = tf.truncated_normal_initializer(mean = 0.001, stddev = 1)):

But the loss seems more crazy:

loss[1.1468245e+29]
loss[1.6610325e+29]
loss[1.1840615e+29]
...

I have to change ‘mean’ and ‘stddev’ again:

                        weights_initializer = tf.truncated_normal_initializer(mean = 0.001, stddev = 1)):

This time, the loss seems correct now.

loss[1215.8724]
loss[1023.67676]
loss[583.6274]
...

This is the first time I saw that initialized value could make the training accuracy so different.

An example for running operation before fetching data in Tensorflow

In tensorflow, what should we do if we want run something before fetching data (such as, using queue in tensorflow)? Here is an example tested by myself:

import tensorflow as tf
from tensorflow.python.framework import ops
def my_func(x):
    print("hello")
    return x
queue = tf.FIFOQueue(10, "float32")
init  = queue.enqueue_many(([1, 2, 3],))
my_op = tf.py_func(my_func, [1.0], tf.float32)
with ops.control_dependencies([my_op]):
    inc = queue.enqueue([1],)
with tf.Session() as sess:
    sess.run(init)
    print("Before Enqueue")
    sess.run(inc)
    print("After Enqueue")

It will print

Before Enqueue
hello
After Enqueue

Successfully, we add an operation before enqueue a item into queue.

Why my model doesn’t converge?

To use Resnet-50 to run CIFAR100 dataset, I wrote a program by using Tensorflow. But when running it, the loss seems keeping in about 4.5~4.6 forever:

step: 199, loss: 4.61291, accuracy: 0
step: 200, loss: 4.60952, accuracy: 0
step: 201, loss: 4.60763, accuracy: 0
step: 202, loss: 4.62495, accuracy: 0
step: 203, loss: 4.62312, accuracy: 0
step: 204, loss: 4.60703, accuracy: 0
step: 205, loss: 4.60947, accuracy: 0
step: 206, loss: 4.59816, accuracy: 0
step: 207, loss: 4.62643, accuracy: 0
step: 208, loss: 4.59422, accuracy: 0
...

After changed models (from Resnet to fully-connect-net), optimizers (from AdamOptimizer to AdagradOptimizer), and even learning rate (from 1e-3 to even 1e-7), the phenomena didn’t change at all.
Finally, I checked the loss and the output vector step by step, and found that the problem is not in model but dataset code:

    def next_batch(self, batch_size = 64):
        images = []
        labels = []
        for i in range(self.pos, self.pos + batch_size):
            image = self.data['data'][self.pos]
            image = image.reshape(3, 32, 32)
            image = image.transpose(1, 2, 0)
            image = image.astype(np.float32) / 255.0
            images.append(image)
            label = self.data['fine_labels'][self.pos]
            labels.append(label)
        if (self.pos + batch_size) >= CIFAR100_TRAIN_SAMPLES:
            self.pos = 0
        else:
            self.pos = self.pos + batch_size
        return [images, labels]

Every batch of data have the same pictures and same labels! Than’t why the model didn’t converge. I should have used ‘i’ instead of ‘self.pos’ as index to fetch data and labels.
So in DeepLearning area, problems comes not only from models and hyper-parameters, but also dataset, or faulty codes…

Problem about using slim.batch_norm() of Tensorflow (second episode)

In previous article, I have found out the reason. But how to resolve it on Multi-GPU-Training is still a question. As the suggestion of this issue in github, I tried two way to fix the problem:
First, rewrite my Averaging-Gradients-Training to learn tf.slim.create_train_op():

...
def create_train_grads(total_loss, optimizer):
  update_ops = set(ops.get_collection(ops.GraphKeys.UPDATE_OPS))
  with ops.control_dependencies(update_ops):
    barrier = control_flow_ops.no_op(name='update_barrier')
  total_loss = control_flow_ops.with_dependencies([barrier], total_loss)
  variables_to_train = tf_variables.trainable_variables()
  grads = optimizer.compute_gradients(total_loss, variables_to_train)
  return grads
...
          cross_entropy = tf.reduce_mean(cross_entropy)
          tf.get_variable_scope().reuse_variables()
          grads = create_train_grads(cross_entropy, opt)
          tower_grads.append(grads)
...
  grads = average_gradients(tower_grads)
  grad_updates = opt.apply_gradients(grads)
  with ops.name_scope('train_op'):
    # Ensure the train_tensor computes grad_updates.
    train_op = control_flow_ops.with_dependencies([grad_updates], cross_entropy)
  # Add the operation used for training to the 'train_op' collection
  train_ops = ops.get_collection_ref(ops.GraphKeys.TRAIN_OP)
  if train_op not in train_ops:
    train_ops.append(train_op)

But unfortunately, this didn’t work at all. The inference result was still a mess.
Then, another way, I use Asynchronous-Gradient-Training and tf.slim.create_train_op():

...
          cross_entropy = tf.reduce_mean(cross_entropy)
          train_op = tf.contrib.slim.learning.create_train_op(cross_entropy, opt)
          tower_ops.append(train_op)
...
  train_step = tf.group(*tower_ops)

Now the inference works very well! And the training speed become a little bit faster than Averaging-Gradients-Training, for the Averaging Operation needs to wait multi gradients from multi GPUs.