After installing an RTX 2080 TI on an old-2016-desktop at the beginning of 2019, we used it to train YOLOv6 for a while. But recently the training job will occasionally hang and the GPU stops working. The only message I can see is from
[ 8104.078794] NVRM: GPU at PCI:0000:01:00: GPU-b4f425ef-2d0f-f29e-5624-ff96b37c2c46 [ 8104.078796] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus. [ 8104.078797] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus. [ 8104.078803] NVRM: A GPU crash dump has been created. If possible, please run NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded.
At first, I suspected the NVIDIA driver was too new. But after installing back to an older driver, the same errors jumped out in
dmesg. And the problem seems to occur more frequently, sometimes could not hold more than 24 hours.
Considering that Ubuntu 18.04 is too old (also the Linux kernel), I tried to upgrade it. Actually, although I installed a lot of Linux systems and Linux kernels in different machines (servers, desktops, laptops, and even the development board), this is the first time I upgraded an existing Ubuntu system.
By following the guide, I barely upgrade from 18.04 to 20.04. Surprisingly, the new system works well with the older NVIDIA driver and the GPU works smoothly for more than 12 hours now.
In conclusion, we should use a new system (new kernel) with new hardware drivers. If the training job doesn’t report any error, I will go on using this 20.04 and saving the time of upgrading to 22.04