Solved – L2 regularization with standard weight initialization

gradient descentmachine learningneural networksregularization

Suppose we have a feedforward neural network with L2 regularization and we train it using SGD initializing the weights with the standard Gaussian. The weight update scheme can be written as:

$$w \rightarrow \left( 1 – {\eta \lambda \over n} \right)w – {\eta \over m} \sum_x{\partial C_x \over \partial w}$$

where $w$ is any given weight in the network, $\eta$ is the learning rate, $\lambda$ is the regularization rate, $m$ is the size of the mini-batch, the sum is over all the training examples $x$ in a given mini-batch and $C_x$ is the cost function for a given training example $x$.

I want to understand four characteristics of this configuration:

1) Supposing $\lambda$ is not too small, the first epochs of training will be dominated almost entirely by weight decay.

2) Provided $\eta \lambda ≪ n$ the weights will decay by a factor of $e^{-{\eta \lambda \over m}}$ per epoch.

3) Supposing $\lambda$ is not too large, the weight decay will tail off when the weights are down to a size around ${1 \over n_w}$, where $n_w$ is the total number of weights in the network.

4) How does this relate to weight initialization using $N(0, {1 \over n_{in}})$, where $n_{in}$ is the number of inputs to a neuron (no regularization)?

My comments:

1) With standard initialization of weights, during the fist epochs of learning, we will often have ${1 \over m} \sum_x{\partial C_x \over \partial w} \approx 0$ and weight decay will be dominant.

2) We could replace $\eta \lambda ≪ n$ with $\lambda$ being a constant and $n \to \infty$. Then we have weight decay at $\lim\limits_{n \to \infty}\left( 1 – {\eta \lambda \over n} \right)^{n \over m}=e^{-{\eta \lambda \over m}}$ per epoch.

3) I don't quite see how we can tie the tapering of the decay weights to $w \approx {1 \over n_w}$. It would seem that the weight decay component vanishes in later epochs, irrespectively of the actual level of the weights (which doesn't have to be $0$ thanks to the gradient component).

4) Perhaps the relation is that L2 cuts the weights and in doing so emulates the effects of $N(0, {1 \over n_{in}})$ initiaization.

I would appreciate any comments on the correctness of 1) and 2), and any ideas on what is going on in 3) and 4).

Best Answer

Here is what I think:

  1. I agree with your explanation. I would only add the reason why each $\cfrac{\partial C_x}{\partial w}\approx0$, which I believe is due to the saturation of the neurons mantained at the first epochs given the ''big'' standard deviation from the random weights generator ($\sigma=1$).

  2. I totally agree with your explanation. Besides this is also agreed in other posts: Data Science 1 and Data Science 2.

  3. and 4) Yes, I think you got it right. I believe so due to the fact that if we want the weights of our network to be normally disributed by $\mathcal{N}\left(0,\sigma \approx \cfrac{1}{\sqrt{n}}\right)$ (following the reasoning of the book at pages 94-96) we shoud be able to reduce its initial $\sigma = 1$ from the initial Normal distribution $\mathcal{N}(0, 1)$ (given by the random weights generator). This is something we can heuristically do by re-scaling them with a factor $\frac{1}{\sqrt{n}}$.

Note: I believe that my last argument is even roughly heuristic in the sense of successive activations of the hidden layers following a Normal distribution. For example, in the third layer of neurons, the activations of the second layer (Normally distributed following the reasoning of pages 94-96) are being multiplied to some Normally initialized weights. Thereby we are multiplying two Normal distributions, so no Normal distribution is held in this and successive layers.