Solved – Predicting CPU and GPU memory requirements of DNN training

deep learning

Say I have some deep learning model architecture, as well as a chosen mini-batch size. How do I derive from these the expected memory requirements for training that model?

As an example, consider a (non-recurrent) model with input of dimension 1000, 4 fully-connected hidden layers of dimension 100, and an additional output layer of dimension 10. The mini-batch size is 256 examples. How does one determine the approximate memory (RAM) footprint of the training process on the CPU and on the GPU?
If it makes any difference, lets assume the model is trained on a GPU with TensorFlow (thus using cuDNN).

Best Answer

The answer of @ik_vision describes how to estimate the memory space needed for storing the weights, but you also need to store the intermediate activations, and especially for convolutional networks working with 3D data, this is the main part of the memory needed.

To analyze your example:

Input needs 1000 elements
After layers 1-4 layer you have 100 elements, 400 in total
1. After final layer you have 10 elements

In total for 1 sample you need 1410 elements for the forward pass. Except for the input, you also need a gradient information about each of them for backward pass, that is 410 more, totaling 1820 elements per sample. Multiply by the batch size to get 465 920.

I said "elements", because the size required per element depends on the data type used. For single precision float32 it is 4B and the total memory needed to store the data blobs will be around 1.8MB.

Related Solutions

Solved – L2-norms of gradients increasing during training of deep neural network

I guess I got what is a problem with a gradient norm value. Basically negative gradient shows a direction to a local minimum value, but it doesn't say how far it is. For this reason you are able to configure you step proportion. When your weight combination is closer to the minimum value your constant step could be bigger than is necessary and some times it hits in wrong direction and in next epooch network try to solve this problem. Momentum algorithm use modified approach. After each iteration it increases weight update if sign for the gradient the same (by an additional parameter that is added to the $\Delta w$ value). In terms of vectors this addition operation can increase magnitude of the vector and change it direction as well, so you are able to miss perfect step even more. To fix this problem network sometimes needs a bigger vector, because minimum value a little further than in the previous epoch.

To prove that theory I build small experiment. First of all I reproduce the same behaviour but for simpler network architecture with less number of iterations.

import numpy as np
from numpy.linalg import norm
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from neupy import algorithms

plt.style.use('ggplot')

grad_norm = []
def train_epoch_end_signal(network):
    global grad_norm
    # Get gradient for the last layer
    grad_norm.append(norm(network.gradients[-1]))

data, target = make_regression(n_samples=10000, n_features=50, n_targets=1)

target_scaler = preprocessing.MinMaxScaler()
target = target_scaler.fit_transform(target)

mnet = Pipeline([
    ('scaler', preprocessing.MinMaxScaler()),
    ('momentum', algorithms.Momentum(
        (50, 30, 1),
        step=1e-10,
        show_epoch=1,
        shuffle_data=True,
        verbose=False,
        train_epoch_end_signal=train_epoch_end_signal,
    )),
])

mnet.fit(data, target, momentum__epochs=100)

After training I checked all gradients on plot. Below you can see similar behaviour as yours.

plt.figure(figsize=(12, 8))
plt.plot(grad_norm)
plt.title("Momentum algorithm final layer gradient 2-Norm")
plt.ylabel("Gradient 2-Norm")
plt.xlabel("Epoch")
plt.show()

Also if look closer into the training procedure results after each epoch you will find that errors are vary as well.

plt.figure(figsize=(12, 8))
network = mnet.steps[-1][1]
network.plot_errors()
plt.show()

Next I using almost the same settings create another network, but for this time I select Golden search algorithm for step selection on each epoch.

grad_norm = []
def train_epoch_end_signal(network):
    global grad_norm
    # Get gradient for the last layer
    grad_norm.append(norm(network.gradients[-1]))
    if network.epoch % 20 == 0:
        print("Epoch #{}: step = {}".format(network.epoch, network.step))

mnet = Pipeline([
    ('scaler', preprocessing.MinMaxScaler()),
    ('momentum', algorithms.Momentum(
        (50, 30, 1),
        step=1e-10,
        show_epoch=1,
        shuffle_data=True,
        verbose=False,
        train_epoch_end_signal=train_epoch_end_signal,
        optimizations=[algorithms.LinearSearch]
    )),
])

mnet.fit(data, target, momentum__epochs=100)

Output below shows step variation at each 20 epoch.

Epoch #0: step = 0.5278640466583575
Epoch #20: step = 1.103484809236065e-13
Epoch #40: step = 0.01315561773591515
Epoch #60: step = 0.018180616551587894
Epoch #80: step = 0.00547810271094794

And if you after that training look closer into the results you will find that variation in 2-norm is much smaller

plt.figure(figsize=(12, 8))
plt.plot(grad_norm)
plt.title("Momentum algorithm final layer gradient 2-Norm")
plt.ylabel("Gradient 2-Norm")
plt.xlabel("Epoch")
plt.show()

And also this optimization reduce variation of errors as well

plt.figure(figsize=(12, 8))
network = mnet.steps[-1][1]
network.plot_errors()
plt.show()

As you can see the main problem with gradient is in the step length.

It's important to note that even with a high variation your network can give you improve in your prediction accuracy after each iteration.

Solved – Why is training a deep convolutional neural network taking longer time than anticipated

There are a number of reasons training might be taking a long time, but the first reason that comes to mind is that you haven't set an appropriate learning rate. If your learning rate is too high, instead of descending towards a minimum, your gradient path will bounce around uncontrollably/erratically, and this could happen indefinitely. If your learning rate is too low (I've seen this happen less frequently in practice), your model will take a longer time to reach the minimum, but it will eventually arrive. I'm not familiar with the particular libraries you've referenced, so unfortunately I can't offer library-specific details.

There is a pretty good general answer (with references) on Stack Overflow about setting a good learning rate in neural networks. See the link below.

https://stackoverflow.com/questions/11414374/neural-network-learning-rate-and-batch-weight-update

Best Answer

Related Solutions

Solved – L2-norms of gradients increasing during training of deep neural network

Solved – Why is training a deep convolutional neural network taking longer time than anticipated

Related Question