Solved – Regularization on weights without bias

deep learningmachine learningneural networksregularizationweights

I was learning neural network using the book "Deep Learning" by Ian Goodfellow, Yoshua Bengio and Aaron Courville. In section 7.1, it says:

…we typically choose to use a parameter norm penalty Ω that penalizes
only the weights of the affine transformation at each layer and leaves
the biases unregularized. The biases typically require less data to
fit accurately than the weights. Each weight specifies how two
variables interact. Fitting the weight well requires observing both
variables in a variety of conditions. Each bias controls only a single
variable…

I don't understand why it says

Each weight speciﬁes how two variables interact

Could someone please explain it please? It would be perfect with some examples.

Best Answer

Here's my understanding of this quote. This is sort of a hand-wavy argument, but still gives some intuition. Let's consider a simple linear layer:

$$y = Wx + b$$

... or equivalently:

$$y_i = x_{1}W_{i,1} + ... + x_{n}W_{i,n} + b_i$$

If we focus on one weight $W_{i,j}$, its value is determined by observing two variables $(x_j, y_i)$. If the training data has $N$ rows, there're only $N$ pairs $(x_j, y_i)$, out of which $W_{i,j}$ is going to learn the correct value. That is a lot of flexibility, which the authors summarize in this phrase:

Fitting the weight well requires observing both variables in a variety of conditions.

In other words, the number of training rows $N$ must be really big in order to capture the correct slope without regularization. On the other hand, $b_i$ affects just $y_i$, which basically means its value can be better estimated from the same number of examples $N$. The authors put it this way:

This means that we do not induce too much variance by leaving the biases unregularized.

In the end, we'd like to regularize the weights that have "more freedom", that's why regularizing $W_{i,j}$ makes more sense than $b_i$.

Best Answer

Related Solutions

Solved – Expected test error

Mean Squared Error – Why MSE is the Cross-Entropy Between Empirical Distribution and Gaussian Model

Related Question