To let DeepSpeed support the failure of one training node, we could use:

deepspeed \
  --master_addr=rogpt1 \
  --elastic_training \
  --min_elastic_nodes=1 \
  --max_elastic_nodes=2 \
  --hostfile=hostfile \
  train.py \
  --deepspeed_config ds_config.json

But if one training node fails and later we want to relaunch it, it will fail to relaunch because it doesn’t have the checkpoint in the local directory. To solve this, there are two solutions:

  1. Using a shared file system (Filestore of GCP, EFS of AWS, or just NFS) for the cluster and only letting the master node save the checkpoint. The saved checkpoint will be seen by all other nodes through the shared file system.
  2. Or, just set “use_node_local_storage” to true. Then all the nodes will save the checkpoints.

{
   "steps_per_print": 2000,
   "checkpoint": {
     "use_node_local_storage": true
   },
   "elasticity": {
     "enabled": true,
     "micro_batch_sizes": [64,128,256],
     "max_train_batch_size": 1024
   },
   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 0.001,
       "betas": [
         0.8,
         0.999
       ],
       "eps": 1e-8,
       "weight_decay": 3e-7
     }
   },
   "scheduler": {
     "type": "WarmupLR",
     "params": {
       "warmup_min_lr": 0,
       "warmup_max_lr": 0.001,
       "warmup_num_steps": 1000
     }
   },
   "wall_clock_breakdown": false
}