Machine Learning – How to Initialize the Final Layer to Get a Good Starting Loss

machine learningneural networksweight-initialization

In this post Karpathy says

verify loss @ init. Verify that your loss starts at the correct loss value. E.g. if you initialize your final layer correctly you should measure -log(1/n_classes) on a softmax at initialization. The same default values can be derived for L2 regression, Huber losses, etc.

I understand that having $L = -\log\frac{1}{\text{n classes}} = \log (\text{n classes})$ is the loss you have when you output the same probability for every class when the problem is balanced (ie: the same number of examples for each class). However, I don't understand how can you initialize your network to output this distribution before training.

Do you know what is he referring to with "initialize your final layer correctly"? Which strategy can you follow to force your network to have an initial loss of -log(1/n_classes)?

Best Answer

It can be done using the bias of the final layer of the network. Here I'll show how to derive it for a balanced classification problem, but the same can be done for unbalanced problems or regression problems.

Usually, in a classification network, the last layer is a softmax with n outputs (one for each class). By setting the appropriate bias we can make the model predict 1/n for each class at initialization. This is because at initialization the layers have random values with a certain $\sigma$ and $\mu=0$. Therefore, after passing the input through the layers, the input to the last layer will have $\mu=0$ as well. The output of the final layer is

$$ \hat y_k = \text{softmax} (W z + b_k) $$

where $z$ is the input, $W$ are the weights at initialization, and $b_k$ is the bias. Notice that $b_k$ is a vector of biases, one for each class. As we showed we expect $z$ to have $\mu = 0$. Therefore, we can control the output of the layer by tunning $b_k$. If we set

$$b_k = b$$

ie the same bias for each class, we will make the softmax output in average a probability of 1/n for each input of the model.

Now, if we set the bias of each layer to be proportional to the relative

Now, given that the cross-entropy loss is

$$ \mathcal{L} = - \sum_i^n y_i \log \hat y_i $$

the expected loss at initialization is then

$$ \begin{align} \mathbb{E}[\mathcal{L}] &= -\mathbb{E}\left[\sum_i^n y_i \log \hat y_i\right]\\ &= - \sum^n \mathbb{E} [y_i \log \hat y_i] \\ &= - \sum^n \frac{1}{n} \log \frac{1}{n} \\ &= -\log \frac{1}{n} \end{align} $$

where we have used that the output probability is 1/n for each class.

If instead of having a balanced class we had an unbalanced problem with frequencies $\{f_1, f_2, ..., f_n\}$ for each class we will need to set the biases to the solutions of this equations system

$$ f_i = \frac{\exp b_i}{ \sum_j \exp b_j} $$ which has the solution

$$ b_i = \log f_i $$

And the expected loss will be

$$ \mathbb{E}\left[\mathcal{L}\right] = - \sum_i^n f_i \log f_i $$