Encountered the following bugs when running PyTorch code:
/opt/conda/conda-bld/pytorch_1634272128894/work/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [225,0,0] Assertion `t >= 0 && t < n_classes` failed. . . loss.backward() . . RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
The location found that this problem occurred at
loss.backward(). Query error information:
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
I found a wide range of solutions on the Internet, in summary, it may be the following problems:
0card by default. In this case, it was observed that the
0graphics card was completely occupied, so the guess is that some data was still loaded on the
0card by default, although
devicehas been specified for the model and data, and the idle graphics card is used. So when
CUDA_VISIBLE_DEVICESto pre-specify the graphics card number, so that all places that use the GPU will be loaded on the specified card (note that the graphics card number in the code needs to be changed to
0, now only visible a card). Test again,
CUDA_VISIBLE_DEVICES=1 python test.pycan be successfully executed.
The reasons for reporting this error can be described as various. Most people on the Internet think that the version is incompatible. Therefore, many people (including myself) have been upgrading and downgrading PyTorch, CUDA and cudnn constantly in order to solve the problem. The result is often no work. And back.
In fact, if you encounter such problems, don't worry about upgrading and downgrading. You should try to see if such problems are caused by the GPU.
In order to determine whether it is a problem on the GPU, you can first set the program to run on the CPU, and it is found that the following error is reported:
Traceback (most recent call last): . . . File "/home/.conda/envs/pt2/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1150, in forward return F.cross_entropy(input, target, weight=self.weight, File "/home/.conda/envs/pt2/lib/python3.8/site-packages/torch/nn/functional.py", line 2846, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) IndexError: Target 3 is out of bounds.
When I was positioned to report an error in the calculation
loss function, I realized that I forgot to set the number of categories output by the model to correspond to the number of categories in the data set.
Set the number of categories output by the model to the number of categories corresponding to the data set.
When encountering problems related to the GPU, don't worry about checking the GPU-related settings. You should mount the model on the CPU to see if there are bugs in the model itself.