I am trying to write code for training on multi-GPUs. The code is mainly from the example of ‘Distributed Tensorflow‘. I have changed the code slightly for runing on GPU:
...
tf.train.replica_device_setter(
worker_device="/job:worker/task:%d/GPU:%d" % (FLAGS.task_index, FLAGS.task_index),
cluster=cluster)
...
But after launch the script below:
python model.py train 0.9 0.0001 0.53 ps 0 &> ps.log &
python model.py train 0.9 0.0001 0.53 worker 0 &> worker0.log &
python model.py train 0.9 0.0001 0.53 worker 1 &> worker1.log &
...
it reports:
Traceback (most recent call last): File "model.py", line 175, inserver = tf.train.Server(cluster, job_name = job_name, task_index = task_index) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 147, in __init__ self._server_def.SerializeToString(), status) File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InternalError: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 11721506816
Seems one MonitoredTrainingSession will occupy all the memory of GPUs. After search on google, I finally get a solution: ‘CUDA_VISIBLE_DEVICES’.
Firstly, change ‘replica_device_setter’:
...
tf.train.replica_device_setter(
worker_device="/job:worker/task:%d/GPU:0" % FLAGS.task_index,
cluster=cluster)
...
and then use this shell script to launch training processes:
CUDA_VISIBLE_DEVICES=0 python model.py train 0.9 0.0001 0.53 ps 0 &> ps.log &
sleep 1
for i in `seq 0 2`; do
dev=`expr ${i} + 1`
CUDA_VISIBLE_DEVICES=${dev} stdbuf -o0 python model.py train 0.9 0.0001 0.53 worker ${i} &> worker_${i}.log &
sleep 1
done
The ‘ps’ will only use GPU0, ‘worker0’ will only use GPU1, ‘worker1’ will only use GPU2 etc.