Using multi-GPUs for training in distributed environment of Tensorflow

I am trying to write code for training on multi-GPUs. The code is mainly from the example of ‘Distributed Tensorflow‘. I have changed the code slightly for runing on GPU:

But after launch the script below:

it reports:

Seems one MonitoredTrainingSession will occupy all the memory of GPUs. After search on google, I finally get a solution: ‘CUDA_VISIBLE_DEVICES’.
Firstly, change ‘replica_device_setter’:

and then use this shell script to launch training processes:

The ‘ps’ will only use GPU0, ‘worker0’ will only use GPU1, ‘worker1’ will only use GPU2 etc.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.