Generally, when training a neural network, the video memory is mainly occupied by the network model and intermediate variables.
The parameters of the convolutional layer, fully connected layer, and standardization layer in the network model occupy video memory, while the activation layer and pooling layer do not essentially occupy video memory.
Intermediate variables include feature maps and optimizers, etc., which consume the most video memory.
In fact, pytorch itself also occupies some video memory, but it does not occupy much. The following methods roughly follow the recommended priority order.
Inplace operation (inplace) literally means to operate variables in situ. Corresponding to pytorch, it is to operate variables on the original memory without applying for new memory space, thereby reducing the use of memory. Specifically, local operations include three ways to achieve:
Use an activation function that defines the inplace attribute as True, such as
Use pytorch methods with in-place operations, generally the method name is followed by an underscore "", such as tensor.add(), tensor.scatter_(), F.relu_()
Operators that use in-place operations, such as
y += x, y *= x
In the forward function of the member method of the custom network structure, avoid using unnecessary intermediate variables and try to operate in the previously applied memory. For example, the following code uses too many intermediate variables and takes up a lot of unnecessary video memory:
def forward(self, x): x0 = self.conv0(x) # Input layer x1 = F.relu_(self.conv1(x0) + x0) x2 = F.relu_(self.conv2(x1) + x1) x3 = F.relu_(self.conv3(x2) + x2) x4 = F.relu_(self.conv4(x3) + x3) x5 = F.relu_(self.conv5(x4) + x4) x6 = self.conv(x5) # Output layer return x6
In order to reduce the memory usage, the above forward function can be modified as follows:
def forward(self, x): x = self.conv0(x) x = F.relu_(self.conv1(x) + x) x = F.relu_(self.conv2(x) + x) x = F.relu_(self.conv3(x) + x) x = F.relu_(self.conv4(x) + x) x = F.relu_(self.conv5(x) + x) x = self.conv(x) return x
The functions implemented by the above two pieces of code are the same, but the occupancy of the video memory is quite different. The latter can save the former occupying nearly 90% of the video memory.
The network model's occupation of video memory mainly refers to the parameters of the convolutional layer, the fully connected layer, and the standardized layer. The specific optimization methods include but are not limited to:
Reduce the number of convolution kernels (= reduce the number of output feature map channels)
Fully connected layer is not used
nn.AdaptiveAvgPool2d() instead of fully connected layer
No standardized layer
Don't jump the connection span too much too much (avoid generating a large number of intermediate variables)
When training a convolutional neural network, epoch represents the number of times the data is trained as a whole, and batch represents splitting an epoch into batch_size batches to participate in training. Reducing batch_size is an idiomatic technique to reduce the video memory usage. When the video memory is not enough during training, the batch_size is generally reduced first, but the batch_size cannot be infinitely small. If it is too large, it will cause network instability, and too small will cause the network to fail to converge.
Splitting the batch is essentially different from reducing batch_size in Technique 4. The operation of splitting the batch can be understood as adding the losses of two trainings and then backpropagating, but the operation of reducing batch_size is training one time in reverse. Spread once. The split batch operation can be understood as three steps, assuming the original batch size batch_size=64:
Split the batch into two small batches with
Input the network and the target value respectively to calculate the loss, and add the obtained loss
After a batch training is completed, a corresponding loss value will be obtained. If you want to calculate the loss of an epoch, you need to accumulate all the previous batch losses, but the previous batch loss occupies the GPU memory, and the directly accumulated epoch loss will also be on the GPU Occupying video memory in the medium, can be optimized by the following methods:
epoch_loss += batch_loss.detach().item() # epoch loss
The effect of the above code is to first release the GPU occupancy of the batch_loss tensor, take out the data in the tensor and then accumulate it.
Reduce training accuracy When training neural networks in pytorch, floating-point numbers use 32-bit floating-point data by default. When training networks that require less precision, they can be changed to 16-bit floating-point data for training, but pay attention to both the data and the network model. Convert to 16-bit floating point data, otherwise an error will be reported. The implementation process of reducing floating-point data is very simple, but if the optimizer chooses Adam, an error may be reported, and the SGD optimizer will not report an error. The specific steps are as follows:
model.cuda().half() # Network model setting half precision # Network input and target setting half precision x, y = Variable(x).cuda().half(), Variable(y).cuda().half()
Mixed precision training Mixed-precision training refers to the use of GPU to train the network, the relevant data is stored in the memory with half-precision as storage and multiplication to speed up the calculation, and full-precision is used for accumulation to avoid rounding errors. This method of mixed longitude training can reduce training time About half, it can also greatly reduce the memory usage. Before pytorch1.6, use the apex library provided by NVIDIA for training, and then use the amp library that comes with pytorch. The example code is as follows:
import torch from torch.nn.functional import mse_loss from torch.cuda.amp import autocast, GradScaler EPOCH = 10 # Training times LEARNING_RATE = 1e-3 # Learning rate x, y = torch.randn(3, 100).cuda(), torch.randn(3, 5).cuda() # Define network input and output myNet = torch.nn.Linear(100, 5).cuda() # Instantiated network, a fully connected layer optimizer = torch.optim.SGD(myNet.parameters(), lr=LEARNING_RATE) # Define optimizer scaler = GradScaler() # Gradient zoom for i in range(EPOCH): # Training with autocast(): # Set up mixed precision operation y_pred = myNet(x) loss = mse_loss(y_pred, y) scaler.scale(loss).backward() # Multiply the tensor by the scale factor and backpropagate scaler.step(optimizer) # Divide the optimizer's gradient tensor by the scale factor. scaler.update() # Update scale factor
If the training network is very deep, such as resnet101 is a very deep network, the direct training of the deep neural network requires very high video memory, and it is generally impossible to directly train the entire network at a time. In this case, the complex network can be divided into two small networks and trained separately. Checkpoint is a solution to insufficient video memory in pytorch that uses time for space. This method essentially reduces the amount of parameters involved in training the entire network at a time. The following is an example code.
import torch import torch.nn as nn from torch.utils.checkpoint import checkpoint # Custom function def conv(inplanes, outplanes, kernel_size, stride, padding): return nn.Sequential(nn.Conv2d(inplanes, outplanes, kernel_size, stride, padding), nn.BatchNorm2d(outplanes), nn.ReLU() ) class Net(nn.Module): # Define the network structure, divided into three sub-networks def __init__(self): super().__init__() self.conv0 = conv(3, 32, 3, 1, 1) self.conv1 = conv(32, 32, 3, 1, 1) self.conv2 = conv(32, 64, 3, 1, 1) self.conv3 = conv(64, 64, 3, 1, 1) self.conv4 = nn.Linear(64, 10) # Fully connected layer def segment0(self, x): # Subnet 1 x = self.conv0(x) return x def segment1(self, x): # Subnet 2 x = self.conv1(x) x = self.conv2(x) x = self.conv3(x) return x def segment2(self, x): # Subnet 3 x = self.conv4(x) return x def forward(self, x): x = checkpoint(self.segment0, x) # Use checkpoint x = checkpoint(self.segment1, x) x = checkpoint(self.segment2, x) return x
The use of checkpoint for network training requires the input attribute
requirements_grad=True. In the code given, a network structure is split into three sub-networks for training. For the case of building a neural network without
nn.Sequential(), it is nothing more than a custom sub-network Add a few more items, or build a network block separately as in the example.
For the large network blocks contained by
nn.Sequential() (not necessary for small network blocks), you can use the
checkpoint_sequential package to simplify the implementation. The specific implementation process is as follows:
import torch import torch.nn as nn from torch.utils.checkpoint import checkpoint_sequential class Net(nn.Module): # Custom network structure, divided into three sub-networks def __init__(self): super().__init__() linear = [nn.Linear(10, 10) for _ in range(100)] self.conv = nn.Sequential(*linear) # Network main body, 100 fully connected layers def forward(self, x): num_segments = 2 # Split into two paragraphs x = checkpoint_sequential(self.conv, num_segments, x) return x
Variables defined in python generally do not release resources immediately at the end of use. The following code can be used to reclaim memory garbage at the beginning of the training loop.
import gc gc.collect() # Clean up memory
Due to the limitation of the video memory size, a larger batch_size cannot be used when training large network models. Generally, a larger batch_size can make the network model converge faster.
Gradient accumulation is to average the losses calculated by multiple batches and then carry out back propagation, similar to the idea of splitting batches in technique 5 (but technique 5 is to split the large batch into smaller ones, and the training is still the large batch, and the gradient The cumulative training is a small batch).
The idea of gradient accumulation can be used to simulate the effect that a larger batch_size can achieve. The specific implementation code is as follows:
output = myNet(input_) loss = mse_loss(target, output) loss = loss / 4 # Accumulate 4 gradients loss.backward() if step % 4 == 0: optimizer.step() optimizer.zero_grad()
When running the test program, no gradient-related operations are involved, so unnecessary gradients can be clarified to save video memory, including but not limited to the following operations:
Use the code model.eval() to put the model in a test state, and do not enable operations such as standardization and randomly discarding neurons.
Put the test code into the context manager with torch.no_grad():, and do not perform operations such as graph construction.
Add gradient zeroing operation at the beginning of each cycle of training or testing
myNet.zero_grad() # Model parameter gradient clear optimizer.zero_grad() # Optimizer parameter gradient is cleared
Similarly, you can also use pytorch's own code to clean up video memory at the beginning of each cycle of training to release unused video memory resources.
torch.cuda.empty_cache() # Release video memory
The video memory resources released by executing this statement cannot be reflected when viewed with the Nvidia-smi command, but they have indeed been released. In fact, in principle, pytorch is automatically released if the variable is no longer referenced, so this statement may be useless, but I think it is somewhat useful.
Downsampling is similar to pooling in terms of implementation, but it is not limited to pooling. In fact, you can also use steps greater than 1 to replace pooling and other operations for downsampling. From the result point of view, the feature map obtained by downsampling will be reduced, and the natural parameter amount of the feature map will be reduced, thereby saving video memory. This can be achieved in the following two ways:
nn.Conv2d(32, 32, 3, 2, 1) # step size greater than 1 downsampling nn.Conv2d(32, 32, 3, 1, 1) # Convolution kernel and pooling downsampling nn.MaxPool2d(2, 2)
The del function is to completely delete a variable. If you want to use it again, you must recreate it. Note that del deletes a variable instead of deleting a data from the memory. This data may also be referenced by other variables. The implementation method is very simple, such as:
def forward(self, x): input_ = x x = F.relu_(self.conv1(x) + input_) x = F.relu_(self.conv2(x) + input_) x = F.relu_(self.conv3(x) + input_) del input_ # delete variable input_ x = self.conv4(x) # output layer return x
The most commonly used optimizers for network training are SGD and Adam. Aside from the final effect of training, SGD has a smaller memory footprint than Adam. If there is really no way, you can try to change the parameter optimization algorithm. The call of this optimization algorithm is similar:
import torch.optim as optim from torchvision.models import resnet18 LEARNING_RATE = 1e-3 # Learning rate myNet = resnet18().cuda() # instantiate the network optimizer_adam = optim.Adam(myNet.parameters(), lr=LEAENING_RATE) # adam network parameter optimization algorithm optimizer_sgd = optim.SGD(myNet.parameters(), lr=LEAENING_RATE) # sgd network parameter optimization algorithm
Buy a graphics card with enough video memory.