InplaceABN Backward Error

created at 12-11-2021 views: 3

error

Recently, when making magical changes to a network containing the InplaceABN module, I encountered the following error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64, 256, 7, 7]], which is output 0 of InPlaceABNBackward, is at version 3; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

When I applied InplaceABN before, I did not study the paper and the code, so when solving this problem, it took hours and trial and error like a headless fly. Although I knew it was a problem caused by the continuous inplace operation, But I didn't locate the specific block of code that caused the problem, and I have been trying `clone()` to solve it in the wrong place. I looked at the issue of github every day to really figure out the cause of the problem.

1. Block provided by InplaceABN

ABN is standard BN + activation (no memory savings).
InPlaceABN is BN+activation done inplace (with memory savings).
InPlaceABNSyncis BN+activation done inplace (with memory savings) + computation of BN (fwd+bwd) with data from all the gpus.

2. Inplace shortcut

out += residual to out = out + residual

+= and add_() are Inplace operations

The problem I encountered is that in ResidualBlock, there are two consecutive inplce operations, InplaceABN and add_.

3. reference

solution

``

created at:12-11-2021
edited at: 12-11-2021: