RuntimeError: cuda runtime error (3) :initialization error at /opt/conda/...

Complete error

RuntimeError: cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch_1579022027550/work/aten/src/THC/THCGeneral.cpp:50
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1579022027550/work/aten/src/THC/THCGeneral.cpp line=50 error=3 : initialization error

I searched it on google, and there are very few similar errors. There is exactly the same, the answer is that cuda can't use this gpu. At that time, I ruled it out directly, thinking that my cuda are all matched, how could this be the problem. But other search results are not the same as mine.

I also wondered if using four cards at once is too much, and the server does not have so many cards?
So I switched to a card, but the error was still the same.


Finally, I tried torch.cuda_is_available() awkwardly, and found that I can't even use gpu.

Enter nvidia-smi directly, and it will display:

Unable to determine the device handle for GPU 0000:08:00.0 GPU is lost. Reboot the system to recover this GPU

Finally, I asked the senior and found out that there was a problem with the No. 3 card of this server. It can't be used. Once you call it, you basically have to restart the server.

So it turns out that the meaning of this error is very simple, that is, there is a problem with the 3rd card and it cannot be initialized. Although I tried a card later, because the previous card No. 3 had already crashed the server's GPU, I couldn't use the GPU before restarting, so the same error was reported.

