Solved – Why Negative Log Likelihood (NLL) is a measure of model’s calibaration

calibrationclassificationloss-functionsmachine learningneural networks

In the context of (multiclass) classification, I've read papers which imply that NLL is minimized iff the model is well-calibrated (outputing true probability for each class and not just confidence), by why is that?

My intuition tells me that since NLL takes in account only the confidence of the model's predicted class $p_i$, then NLL is minimized as long as $p_i$ approaches $1$.

Thus, a model can be overconfident (not well-calibrated) and still minimize NLL.

Can someone elaborate on what am I missing here?

Best Answer

Without loss of generality, let's assume binary classification. For simplicity and illustration, let's assume that there is only one feature and it takes only one value (that is, it's a constant). Since effectively there are no covariates, there is only one parameter to estimate here, the probability $p$ of the positive class. Given data, which effectively consists of only $y$ in this case, learning or training becomes identical to the problem of parameter estimation for binomial distribution, for which any standard statistics textbook would contain some derivation like this:

Likehood $\displaystyle L(p) = {n \choose k} p^k (1-p)^{n-k}$, take the log of it and set the partial derivative to zero, $\displaystyle \frac{\partial \log L(p)}{\partial p}=0$. Solving it gives $\hat{p} = \frac{k}{n}$.

Now, allow $n \rightarrow \infty$, and let the true but unknown probability of the positive class be $\pi$. The likelihood becomes $\displaystyle L(p) = {n \choose n\pi} p^{n\pi} (1-p)^{n(1-\pi)}$. Repeating the same steps as above, which is legitimate despite $n \rightarrow \infty$, gives $\hat{p} = \pi$. Perfect calibration, achieved through likelihood maximization.

Allowing for covariates means one has to model $p(y=1|x)$ (say, using $1/\left(1+\exp{(-(\beta_0+\beta^T x))}\right)$ as in logistic regression), which can be imperfect and hence likelihood is only maximized over a particular functional family (say, the one used in logistic regression above; this is aka "parametric restriction" in some contexts) but not over all possible families, hence giving potentially miscalibrated probabilities. If we allow for all possible functional families to model $p(y=1|x)$, the likelihood would be truly maximized and perfect calibration achieved, in the same way as the toy example shows above. But that would understandably require infinite data, since it amounts to a parametric model with infinite parameters.

I think your intuition missed the fact that the likelihood depends on the true probabilities in the exponentiated form above, hence maximizing it would bring the estimated probabilities close to the true ones, as oppose to close to 1.

Related Question