Today, I am replacing my own data set and rerunning the convE
model. It is a magical error:
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
During the time between running the two models, I did not touch my GPU environment, including the previous kernel used:
Kernel version: 5.13.0-35-generic
uname -a
Linux zax-Lenovo-Legion-R7000P2020H 5.13.0-35-generic ...
In addition, it is no problem to use this environment to run OpenKE [this toolkit does not contain convE]
I searched for this error on the Internet, some said that the graphics card number was specified, and most of them said that the version of cuda
and cuDNN did not match, so install the corresponding version of cuDNN
. Since the versions of cuda
and cuDNN
match when I installed it before [whether it matches, you can check it on the official website of cuDNN
], so I uninstalled and re-installed cuDNN
, and finally succeeded, and solved the process of troubleshooting today. Put it below, you can refer to it if you need it.
Enter python to enter the environment and run the command:
import torch
print(torch.cuda.device_count()) #Number of GPUs available
If there is no available graphics card or the GPU environment is incorrectly configured, 0
will be output, and the following error will be reported when the model is running:
cuda runtime error (38) : no CUDA-capable device is detected
My inexplicable restart solved this error [you can also go through the advanced options of booting, change the kernel version to see], output 0
This situation is different for each machine, the detailed reasons can be solved by Baidu, it is useless to say more...
If the output is greater than or equal to `2`, the error mentioned in the preface may be caused by not specifying the graphics card number. You can add a line before the model code:
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
Since my output is `1`, the error should not be in this section
### 2.Check the cuda version and cuDNN version
**① Check the cuda version**
Query the highest version supported by the current driver:
nvidia-smi
Query the currently installed cuda version:
nvcc -V
You can also see more detailed information, including the driver with which it `communicates`, with the following command:
cat /usr/local/cuda/cuda.json
**② Check cuDNN version**
There is a `cudnn.h` file but no output, the version is unknown, in theory, the `v 8.3.0` I installed before will not fool this library
cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
3. Uninstall and reinstall cuDNN
**① Uninstall cuDNN v 8.3.0**
Find where:
sudo dpkg -l | grep cudnn

sudo dpkg -r libcudnn8-samples sudo dpkg -r libcudnn8-dev sudo dpkg -r libcudnn8
**② Install cuDNN v 8.3.0**
I won't go into details here
However, in the stage of verifying whether the installation was successful, a small error was reported:
AttributeError: module 'torch.jit' has no attribute 'unused'
Reason: `torch` and `torchvision` versions do not match
solution
Install the `torch 1.4.0` version:
pip install torch==1.4.0
Install torchvision version 0.5.0
pip install torchvision==0.5.0
Verify again
$ python3
>>> import torch import torchvision
>>> print(torch.cuda.is_available())
True
>>> exit()
4.Rerun model convE
python wrangle_KG.py BA
CUDA_VISIBLE_DEVICES=0 python main.py --model conve --data BA \
--input-drop 0.2 --hidden-drop 0.3 --feat-drop 0.2 \
--lr 0.003 --preprocess
successfully executed