Solved – L2 regularization with standard weight initialization

gradient descentmachine learningneural networksregularization

Suppose we have a feedforward neural network with L2 regularization and we train it using SGD initializing the weights with the standard Gaussian. The weight update scheme can be written as:

$$w \rightarrow \left( 1 – {\eta \lambda \over n} \right)w – {\eta \over m} \sum_x{\partial C_x \over \partial w}$$

where $w$ is any given weight in the network, $\eta$ is the learning rate, $\lambda$ is the regularization rate, $m$ is the size of the mini-batch, the sum is over all the training examples $x$ in a given mini-batch and $C_x$ is the cost function for a given training example $x$.

I want to understand four characteristics of this configuration:

1) Supposing $\lambda$ is not too small, the first epochs of training will be dominated almost entirely by weight decay.

2) Provided $\eta \lambda ≪ n$ the weights will decay by a factor of $e^{-{\eta \lambda \over m}}$ per epoch.

3) Supposing $\lambda$ is not too large, the weight decay will tail off when the weights are down to a size around ${1 \over n_w}$, where $n_w$ is the total number of weights in the network.

4) How does this relate to weight initialization using $N(0, {1 \over n_{in}})$, where $n_{in}$ is the number of inputs to a neuron (no regularization)?

My comments:

1) With standard initialization of weights, during the fist epochs of learning, we will often have ${1 \over m} \sum_x{\partial C_x \over \partial w} \approx 0$ and weight decay will be dominant.

2) We could replace $\eta \lambda ≪ n$ with $\lambda$ being a constant and $n \to \infty$. Then we have weight decay at $\lim\limits_{n \to \infty}\left( 1 – {\eta \lambda \over n} \right)^{n \over m}=e^{-{\eta \lambda \over m}}$ per epoch.

3) I don't quite see how we can tie the tapering of the decay weights to $w \approx {1 \over n_w}$. It would seem that the weight decay component vanishes in later epochs, irrespectively of the actual level of the weights (which doesn't have to be $0$ thanks to the gradient component).

4) Perhaps the relation is that L2 cuts the weights and in doing so emulates the effects of $N(0, {1 \over n_{in}})$ initiaization.

I would appreciate any comments on the correctness of 1) and 2), and any ideas on what is going on in 3) and 4).

Best Answer

Here is what I think:

I agree with your explanation. I would only add the reason why each $\cfrac{\partial C_x}{\partial w}\approx0$, which I believe is due to the saturation of the neurons mantained at the first epochs given the ''big'' standard deviation from the random weights generator ($\sigma=1$).
I totally agree with your explanation. Besides this is also agreed in other posts: Data Science 1 and Data Science 2.
and 4) Yes, I think you got it right. I believe so due to the fact that if we want the weights of our network to be normally disributed by $\mathcal{N}\left(0,\sigma \approx \cfrac{1}{\sqrt{n}}\right)$ (following the reasoning of the book at pages 94-96) we shoud be able to reduce its initial $\sigma = 1$ from the initial Normal distribution $\mathcal{N}(0, 1)$ (given by the random weights generator). This is something we can heuristically do by re-scaling them with a factor $\frac{1}{\sqrt{n}}$.

Note: I believe that my last argument is even roughly heuristic in the sense of successive activations of the hidden layers following a Normal distribution. For example, in the third layer of neurons, the activations of the second layer (Normally distributed following the reasoning of pages 94-96) are being multiplied to some Normally initialized weights. Thereby we are multiplying two Normal distributions, so no Normal distribution is held in this and successive layers.

Best Answer

Related Solutions

Neural Networks – Weight Change Momentum and Weight Decay Explained

Solved – How to update weights in a neural network using gradient descent with mini-batches

Related Question