RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

created at 12-11-2021 views: 298

Problem Description

Encountered the following bugs when running PyTorch code:

/opt/conda/conda-bld/pytorch_1634272128894/work/aten/src/ATen/native/cuda/NLLLoss2d.cu:95: nll_loss2d_forward_kernel: block: [0,0,0], thread: [225,0,0] Assertion `t >= 0 && t < n_classes` failed.
.
.
loss.backward()
.
.
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

The location found that this problem occurred at loss.backward(). Query error information:

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

I found a wide range of solutions on the Internet, in summary, it may be the following problems:

  1. Version compatibility issues (graphics driver, CUDA, cudnn version)
  2. Video memory overflow problem (change batch size, num_worker, set a GPU, disable cudnn)
  3. Clean up the cache (Pycharm and nvidia cache)
  4. The number of categories is set incorrectly
  5. Some data was still loaded on the 0 card by default. In this case, it was observed that the 0 graphics card was completely occupied, so the guess is that some data was still loaded on the 0 card by default, although device has been specified for the model and data, and the idle graphics card is used. So when test uses CUDA_VISIBLE_DEVICES to pre-specify the graphics card number, so that all places that use the GPU will be loaded on the specified card (note that the graphics card number in the code needs to be changed to 0, now only visible a card). Test again, CUDA_VISIBLE_DEVICES=1 python test.py can be successfully executed.

Cause Analysis

The reasons for reporting this error can be described as various. Most people on the Internet think that the version is incompatible. Therefore, many people (including myself) have been upgrading and downgrading PyTorch, CUDA and cudnn constantly in order to solve the problem. The result is often no work. And back.

In fact, if you encounter such problems, don't worry about upgrading and downgrading. You should try to see if such problems are caused by the GPU.

In order to determine whether it is a problem on the GPU, you can first set the program to run on the CPU, and it is found that the following error is reported:

Traceback (most recent call last):
.
.
.
  File "/home/.conda/envs/pt2/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1150, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/.conda/envs/pt2/lib/python3.8/site-packages/torch/nn/functional.py", line 2846, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
IndexError: Target 3 is out of bounds.

When I was positioned to report an error in the calculation loss function, I realized that I forgot to set the number of categories output by the model to correspond to the number of categories in the data set.

solution

Set the number of categories output by the model to the number of categories corresponding to the data set.

Enlightenment

When encountering problems related to the GPU, don't worry about checking the GPU-related settings. You should mount the model on the CPU to see if there are bugs in the model itself.

created at:12-11-2021
edited at: 06-21-2022: