Sometimes I get this error from TPUEstimator:

...
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run                                                 
    run_metadata_ptr)                                                                                                                                 
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run                                               
    feed_dict_tensor, options, run_metadata)                                                                                                          
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run                                           
    run_metadata)                                                                                                                                     
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call                                           
    raise type(e)(node_def, op, message)                                                                                                  
tensorflow.python.framework.errors_impl.DeadlineExceededError: Deadline Exceeded    

And after stop and restart TPU in console of GCP, the error disappeared. TPU doesn’t allow users to use it directly like GPU. You can’t see the device in VM looks like ‘/dev/tpu’ or something like this. Google provides TPU as RPC service, so you can only run DNN training through this service. I think this RPC service is not stable enough so sometimes it can’t work and lead to the error ‘Deadline Exceeded’.

When I get this type of error from TPU:

2018-09-29 01:57:12.779430: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.

The only solution is to create a new TPU instance and delete the old one in GCP console. Seems Google need to improve the robust of their TPU RPC service.

Running 10000 steps and get ‘loss’ for every turn:

INFO:tensorflow:Loss for final step: 3.2015076.
INFO:tensorflow:Loss for final step: 2.5733204.
INFO:tensorflow:Loss for final step: 1.8888541.
INFO:tensorflow:Loss for final step: 2.3713436.
INFO:tensorflow:Loss for final step: 2.9957836.
INFO:tensorflow:Loss for final step: 1.3974692.
INFO:tensorflow:Loss for final step: 1.3933656.
INFO:tensorflow:Loss for final step: 2.3544135.
INFO:tensorflow:Loss for final step: 1.9383199.
INFO:tensorflow:Loss for final step: 2.0213509.
INFO:tensorflow:Loss for final step: 1.8641331.
INFO:tensorflow:Loss for final step: 1.6767861.
INFO:tensorflow:Loss for final step: 2.63849.
INFO:tensorflow:Loss for final step: 2.19468.
INFO:tensorflow:Loss for final step: 1.9854712.
INFO:tensorflow:Loss for final step: 1.9380764.
INFO:tensorflow:Loss for final step: 0.97299415.
INFO:tensorflow:Loss for final step: 2.089243.
INFO:tensorflow:Loss for final step: 2.1150723.
INFO:tensorflow:Loss for final step: 1.8242038.
INFO:tensorflow:Loss for final step: 2.8426473.

It’s quite strange that the ‘loss’ can’t go low enough. I still need to do more experiments.

Previously, I run MobileNet_v2 in a machine with Geforce GTX 960 and it could process 100 samples per second. And by using 8 TPUs of Version 2, it can process about 500 samples per second. Firstly, I am so disappointed about the performance-boosting of TPUv2, for it only has about 1.4 TFLOPS for each. But then I noticed that may be the bottleneck is not the performance of TPU, since IO is usually the limit for training speed. Besides, my model is MobileNet_v2, which is too simple and light so it can’t excavate all the capability of TPU.
Therefore I set ‘depth_multiplier=4’ for MobileNet_v2. Under this model, GTX 960 could process 21 samples per second, and TPUv2-8 could process 275 samples per second. This time, I can estimate each TPUv2 has about 4 TFLOPS. I know this metric seems too low from Google’s official 45 TFLOPS. But considering the possible bottlenecks of storage IO and network bandwidth, it becomes understandable. And also, there is another possibility: Google’s 45 TFLOPS means the half-precision operation performance 🙂

Google has just release Tensorflow 1.11 for TPU clusters. At first, I think I can use hooks in TPUEstimatorSpec now, but after adding

def model_fn():
    ...
    logging_hook = tf.train.LoggingTensorHook({'loss': loss}, every_n_iter = 100)
    return tf.contrib.tpu.TPUEstimatorSpec(mode, loss = loss, training_hooks = [logging_hook], train_op = train_op)

it reports

INFO:tensorflow:Error recorded from training_loop: Operation u'total_loss' has been marked as not fetchable.

Certainly, the TPU is much harder to use and debug than GPU/CPU.