It took a day to solve this problem. . . .
Train the code well, change a machine, and report an error.
I thought it was caused by cuda11. I was worried that the cuda version and the pytorch version did not match. I reinstalled it, but the result was not resolved.
Problem phenomenon:
raceback (most recent call last):
File "train.py", line 100, in <module>
main(opt)
File "train.py", line 71, in main
……
File "/home/xxxx/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 64, 80, 144], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(64, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
data_type = CUDNN_DATA_FLOAT
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0xaa030590
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 1, 64, 80, 144,
strideA = 737280, 11520, 144, 1,
output: TensorDescriptor 0xaa0d6560
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 1, 64, 80, 144,
strideA = 737280, 11520, 144, 1,
weight: FilterDescriptor 0xaa0d0360
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 64, 64, 3, 3,
Pointer addresses:
input: 0x567e50000
output: 0x568120000
weight: 0x550a2da00
Save the cuda prompt to a file,
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 64, 80, 144], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(64, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
Then python runs it, it will report the same error, and then select the switch to make adjustments, and try again to see if it still reports the error.
For my code, modifying the following items will work.
torch.backends.cudnn.benchmark = False
Then put this in front of the problematic code.