In previous article, I have found out the reason. But how to resolve it on Multi-GPU-Training is still a question. As the suggestion of this issue in github, I tried two way to fix the problem:
First, rewrite my Averaging-Gradients-Training to learn tf.slim.create_train_op():
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
... def create_train_grads(total_loss, optimizer): update_ops = set(ops.get_collection(ops.GraphKeys.UPDATE_OPS)) with ops.control_dependencies(update_ops): barrier = control_flow_ops.no_op(name='update_barrier') total_loss = control_flow_ops.with_dependencies([barrier], total_loss) variables_to_train = tf_variables.trainable_variables() grads = optimizer.compute_gradients(total_loss, variables_to_train) return grads ... cross_entropy = tf.reduce_mean(cross_entropy) tf.get_variable_scope().reuse_variables() grads = create_train_grads(cross_entropy, opt) tower_grads.append(grads) ... grads = average_gradients(tower_grads) grad_updates = opt.apply_gradients(grads) with ops.name_scope('train_op'): # Ensure the train_tensor computes grad_updates. train_op = control_flow_ops.with_dependencies([grad_updates], cross_entropy) # Add the operation used for training to the 'train_op' collection train_ops = ops.get_collection_ref(ops.GraphKeys.TRAIN_OP) if train_op not in train_ops: train_ops.append(train_op) |
But unfortunately, this didn’t work at all. The inference result was still a mess.
Then, another way, I use Asynchronous-Gradient-Training and tf.slim.create_train_op():
1 2 3 4 5 6 7 |
... cross_entropy = tf.reduce_mean(cross_entropy) train_op = tf.contrib.slim.learning.create_train_op(cross_entropy, opt) tower_ops.append(train_op) ... train_step = tf.group(*tower_ops) |
Now the inference works very well! And the training speed become a little bit faster than Averaging-Gradients-Training, for the Averaging Operation needs to wait multi gradients from multi GPUs.