Tag Archives: tensorflow

Problems and solutions about building Tensorflow-1.8 with TensorRT 4.0

Problem: When compiling Tensorflow-1.8 with CUDA-9.2, it reports:

Solution: Add ‘/usr/local/cuda-9.2/lib64’ into ‘/etc/ld.so.conf’ and run ‘sudo ldconfig’ to make it works. Problem: When compiling Tensorflow-1.8, it reports:

Solution: In ‘.tf_configure.bazelrc’ file, use real python location instead of soft link:

Problem: When running TensorRT, it reports:

Solution:… Read more »

Testing performance of Tensorflow’s fixed-point-quantization on x86_64 cpu

Google has published their quantization method on this paper. It use int8 to run feed-forward but float32 for back-propagation, since back-propagation need more accurate to accumulate gradients. I got a question right after reading the paper: why all the performance test works are on platform of mobile-phone (ARM architecture)? The… Read more »

To check abnormal loss value when training a new model

Yesterday I wrote a Tensorflow program to train CIFAR100 dataset with Resnet-50 model. But when the training begin, I saw the ‘loss’ of classification is abnormally big and didn’t reduce at all:

Firstly, I thought the code for processing dataset may be wrong. But after print out the data… Read more »

An example for running operation before fetching data in Tensorflow

In tensorflow, what should we do if we want run something before fetching data (such as, using queue in tensorflow)? Here is an example tested by myself:

It will print

Successfully, we add an operation before enqueue a item into queue.

Why my model doesn’t converge?

To use Resnet-50 to run CIFAR100 dataset, I wrote a program by using Tensorflow. But when running it, the loss seems keeping in about 4.5~4.6 forever:

After changed models (from Resnet to fully-connect-net), optimizers (from AdamOptimizer to AdagradOptimizer), and even learning rate (from 1e-3 to even 1e-7), the phenomena… Read more »

Problem about using slim.batch_norm() of Tensorflow (second episode)

In previous article, I have found out the reason. But how to resolve it on Multi-GPU-Training is still a question. As the suggestion of this issue in github, I tried two way to fix the problem: First, rewrite my Averaging-Gradients-Training to learn tf.slim.create_train_op():

But unfortunately, this didn’t work at… Read more »

Problem about using slim.batch_norm() of Tensorflow

After using resnet_v2_50 in tensorflow/models, I found that the inference result is totally incorrect, though the training accuracy looks very well. Firstly, I suspected the regularization of samples:

Indeed I had extended the image to a too big size. But after I changing padding size to ’10’, the inference… Read more »

Experiment for distributed Tensorflow

Here is my experimental code for distributed Tensorflow, which is learned from the example.

The important thing is that we need to use tf.assign() to push Variable back to Parameter Server. The operation ‘tf.add’ was about to run on the task0 of worker in this example. But if we… Read more »

The CSE (Common Subexpression Elimination) problem about running custom operation in Tensorflow

Recently, we create a new custom operation in Tensorflow:

It’s as simple as the example in Tensorflow’s document. But when we run this Op in session:

It only get image_ids from network once, and then use the result of first ‘run’ forever, without even call ‘Compute()’ function in… Read more »