After running my MXNet application like this snippet:

```
MXNET_CUDNN_AUTOTUNE_DEFAULT=0 python bird.py train 0.8 &> log
```

I found out that the training speed is only 300 samples per second, and the usage of GPU looks very strange:

# nvidia-smi -l |grep Default | N/A 48C P0 184W / 250W | 4971MiB / 16276MiB | 98% Default | | N/A 44C P0 145W / 250W | 4971MiB / 16276MiB | 98% Default | | N/A 47C P0 44W / 250W | 4971MiB / 16276MiB | 98% Default | | N/A 43C P0 40W / 250W | 4971MiB / 16276MiB | 98% Default | | N/A 45C P0 182W / 250W | 4971MiB / 16276MiB | 0% Default | | N/A 42C P0 182W / 250W | 4971MiB / 16276MiB | 0% Default | | N/A 44C P0 42W / 250W | 4971MiB / 16276MiB | 0% Default | | N/A 42C P0 41W / 250W | 4971MiB / 16276MiB | 0% Default | | N/A 45C P0 42W / 250W | 4971MiB / 16276MiB | 0% Default | | N/A 42C P0 38W / 250W | 4971MiB / 16276MiB | 0% Default | | N/A 45C P0 42W / 250W | 4971MiB / 16276MiB | 21% Default | | N/A 42C P0 38W / 250W | 4971MiB / 16276MiB | 17% Default | | N/A 45C P0 42W / 250W | 4971MiB / 16276MiB | 48% Default | | N/A 42C P0 38W / 250W | 4971MiB / 16276MiB | 44% Default | | N/A 45C P0 42W / 250W | 4971MiB / 16276MiB | 0% Default | | N/A 42C P0 38W / 250W | 4971MiB / 16276MiB | 0% Default | | N/A 45C P0 42W / 250W | 4971MiB / 16276MiB | 41% Default | | N/A 42C P0 38W / 250W | 4971MiB / 16276MiB | 36% Default | | N/A 45C P0 43W / 250W | 4971MiB / 16276MiB | 98% Default | | N/A 42C P0 38W / 250W | 4971MiB / 16276MiB | 98% Default |

About two days later, I just noticed that there are some messages reported by MXNet:

INFO:root:Using 1 threads for decoding... INFO:root:Set enviroment variable MXNET_CPU_WORKER_NTHREADS to a larger number to use more threads.

After changing my command to:

```
MXNET_CPU_WORKER_NTHREADS=16 MXNET_CUDNN_AUTOTUNE_DEFAULT=0 python bird.py train 0.8 &> log
```

the speed of training has changed to 690 samples per second, and the usage of GPU became much smoothly since it could use more CPUs to decode image now:

# nvidia-smi -l |grep Default | N/A 40C P0 173W / 250W | 4971MiB / 16276MiB | 96% Default | | N/A 45C P0 182W / 250W | 4971MiB / 16276MiB | 98% Default | | N/A 42C P0 183W / 250W | 4971MiB / 16276MiB | 96% Default | | N/A 46C P0 163W / 250W | 4971MiB / 16276MiB | 98% Default | | N/A 43C P0 153W / 250W | 4971MiB / 16276MiB | 98% Default | | N/A 48C P0 181W / 250W | 4971MiB / 16276MiB | 96% Default | | N/A 44C P0 168W / 250W | 4971MiB / 16276MiB | 96% Default | | N/A 49C P0 190W / 250W | 4971MiB / 16276MiB | 98% Default | | N/A 45C P0 181W / 250W | 4971MiB / 16276MiB | 96% Default | | N/A 50C P0 136W / 250W | 4971MiB / 16276MiB | 98% Default | | N/A 46C P0 138W / 250W | 4971MiB / 16276MiB | 98% Default | | N/A 51C P0 186W / 250W | 4971MiB / 16276MiB | 96% Default | | N/A 47C P0 161W / 250W | 4971MiB / 16276MiB | 96% Default | | N/A 52C P0 212W / 250W | 4971MiB / 16276MiB | 98% Default | | N/A 48C P0 192W / 250W | 4971MiB / 16276MiB | 96% Default | | N/A 52C P0 155W / 250W | 4971MiB / 16276MiB | 98% Default | | N/A 48C P0 152W / 250W | 4971MiB / 16276MiB | 98% Default | | N/A 54C P0 180W / 250W | 4971MiB / 16276MiB | 96% Default | | N/A 49C P0 166W / 250W | 4971MiB / 16276MiB | 96% Default | | N/A 54C P0 194W / 250W | 4971MiB / 16276MiB | 98% Default |