RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

created at 06-12-2022 views: 62

problem

Today, I am replacing my own data set and rerunning the convE model. It is a magical error:

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

During the time between running the two models, I did not touch my GPU environment, including the previous kernel used:

Kernel version: 5.13.0-35-generic

uname -a

Linux zax-Lenovo-Legion-R7000P2020H 5.13.0-35-generic ...

In addition, it is no problem to use this environment to run OpenKE [this toolkit does not contain convE]

I searched for this error on the Internet, some said that the graphics card number was specified, and most of them said that the version of cuda and cuDNN did not match, so install the corresponding version of cuDNN. Since the versions of cuda and cuDNN match when I installed it before [whether it matches, you can check it on the official website of cuDNN], so I uninstalled and re-installed cuDNN, and finally succeeded, and solved the process of troubleshooting today. Put it below, you can refer to it if you need it.

solution

1.Check if there is an available graphics card

Enter python to enter the environment and run the command:

import torch
print(torch.cuda.device_count()) #Number of GPUs available

If there is no available graphics card or the GPU environment is incorrectly configured, 0 will be output, and the following error will be reported when the model is running:

cuda runtime error (38) : no CUDA-capable device is detected

My inexplicable restart solved this error [you can also go through the advanced options of booting, change the kernel version to see], output 0 This situation is different for each machine, the detailed reasons can be solved by Baidu, it is useless to say more...


If the output is greater than or equal to `2`, the error mentioned in the preface may be caused by not specifying the graphics card number. You can add a line before the model code:

os.environ['CUDA_VISIBLE_DEVICES'] = '0'


Since my output is `1`, the error should not be in this section

### 2.Check the cuda version and cuDNN version

**① Check the cuda version**

Query the highest version supported by the current driver:

nvidia-smi


Query the currently installed cuda version:

nvcc -V


You can also see more detailed information, including the driver with which it `communicates`, with the following command:

cat /usr/local/cuda/cuda.json


**② Check cuDNN version**

There is a `cudnn.h` file but no output, the version is unknown, in theory, the `v 8.3.0` I installed before will not fool this library

cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2


3. Uninstall and reinstall cuDNN

**① Uninstall cuDNN v 8.3.0**

Find where:

sudo dpkg -l | grep cudnn



![find cudnn](https://static-workflow.s3.amazonaws.com/media/content_media/waitforme/2249d27a-9a48-414d-a04a-f70cbb8d9665.png)

sudo dpkg -r libcudnn8-samples sudo dpkg -r libcudnn8-dev sudo dpkg -r libcudnn8


**② Install cuDNN v 8.3.0**

I won't go into details here

However, in the stage of verifying whether the installation was successful, a small error was reported:

AttributeError: module 'torch.jit' has no attribute 'unused'


Reason: `torch` and `torchvision` versions do not match

solution

Install the `torch 1.4.0` version:

pip install torch==1.4.0


Install torchvision version 0.5.0

pip install torchvision==0.5.0


Verify again

$ python3
>>> import torch import torchvision
>>> print(torch.cuda.is_available())
True
>>> exit()

4.Rerun model convE

python wrangle_KG.py BA

CUDA_VISIBLE_DEVICES=0 python main.py --model conve --data BA \
                                       --input-drop 0.2 --hidden-drop 0.3 --feat-drop 0.2 \
                                       --lr 0.003 --preprocess

successfully executed

created at:06-12-2022
edited at: 06-12-2022: