After running my MXNet application like this snippet:

MXNET_CUDNN_AUTOTUNE_DEFAULT=0 python bird.py train 0.8 &> log

I found out that the training speed is only 300 samples per second, and the usage of GPU looks very strange:

# nvidia-smi -l |grep Default
| N/A   48C    P0   184W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   44C    P0   145W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   47C    P0    44W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   43C    P0    40W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   45C    P0   182W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   42C    P0   182W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   44C    P0    42W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   42C    P0    41W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |     21%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |     17%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |     48%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |     44%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |     41%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |     36%      Default |
| N/A   45C    P0    43W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |     98%      Default |

About two days later, I just noticed that there are some messages reported by MXNet:

INFO:root:Using 1 threads for decoding...
INFO:root:Set enviroment variable MXNET_CPU_WORKER_NTHREADS to a larger number to use more threads.

After changing my command to:

MXNET_CPU_WORKER_NTHREADS=16 MXNET_CUDNN_AUTOTUNE_DEFAULT=0 python bird.py train 0.8 &> log

the speed of training has changed to 690 samples per second, and the usage of GPU became much smoothly since it could use more CPUs to decode image now:

# nvidia-smi -l |grep Default
| N/A   40C    P0   173W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   45C    P0   182W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   42C    P0   183W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   46C    P0   163W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   43C    P0   153W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   48C    P0   181W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   44C    P0   168W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   49C    P0   190W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   45C    P0   181W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   50C    P0   136W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   46C    P0   138W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   51C    P0   186W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   47C    P0   161W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   52C    P0   212W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   48C    P0   192W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   52C    P0   155W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   48C    P0   152W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   54C    P0   180W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   49C    P0   166W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   54C    P0   194W / 250W |   4971MiB / 16276MiB |     98%      Default |