Encountered the following bugs when running PyTorch code:
/opt/conda/conda-bld/pytorch_1634272128894/work/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [225,0,0] Assertion `t >= 0 && t < n_classes` failed. . . loss.backward() . . RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
The location found that this problem occurred at
loss.backward(). Query error information:
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
I found a wide range of solutions on the Internet, in summary, it may be the following problems:
The reasons for reporting this error can be described as various. Most people on the Internet think that the version is incompatible. Therefore, many people (including myself) have been upgrading and downgrading PyTorch, CUDA and cudnn constantly in order to solve the problem. The result is often no work. And back.
In fact, if you encounter such problems, don't worry about upgrading and downgrading. You should try to see if such problems are caused by the GPU.
In order to determine whether it is a problem on the GPU, you can first set the program to run on the CPU, and it is found that the following error is reported:
Traceback (most recent call last): . . . File "/home/.conda/envs/pt2/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1150, in forward return F.cross_entropy(input, target, weight=self.weight, File "/home/.conda/envs/pt2/lib/python3.8/site-packages/torch/nn/functional.py", line 2846, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) IndexError: Target 3 is out of bounds.
When I was positioned to report an error in the calculation
loss function, I realized that I forgot to set the number of categories output by the model to correspond to the number of categories in the data set.
Set the number of categories output by the model to the number of categories corresponding to the data set.
When encountering problems related to the GPU, don't worry about checking the GPU-related settings. You should mount the model on the CPU to see if there are bugs in the model itself.