Solved – “Multinomial Deviance” in the glmnet package

glmnetmultinomial-distributionr

I am fitting a multinomial logistic regression using the glmnet package in R:

library(glmnet)
data(MultinomialExample)
cvfit=cv.glmnet(x, y, family="multinomial", type.multinomial = "grouped")
plot(cvfit)

rplot

What is "Multinomial Deviance" and how does it relate to "Multinomial Logloss"?

Best Answer

Deviance is a specific transformation of a likelihood ratio. In particular, we consider the model-based likelihood after some fitting has been done and compare this to the likelihood of what is called the saturated model. This latter is a model that has as many parameters as data points and achieves a perfect fit, so by looking at the likelihood ratio we're measuring in some sense how far our fitted model is from a "perfect" model.

In the multinomial regression case we have data of the form $(x_1, y_1), (x_2, y_2), \ldots , (x_n, y_n)$ where $y_i$ is a $k$-vector which indicates which class observation $i$ belongs to (exactly one entry contains a one and the rest are zero). Now if we fit some model that estimates a vector of probabilities $\hat{p}(x) = (\hat{p}_1(x), \hat{p}_2(x), \ldots, \hat{p}_k(x))$ then the model-based likelihood can be written

$$ \prod_{i=1}^{n} \prod_{i=j}^{k} \hat{p}_j(x_i)^{y_{ij}} . $$

The saturated model on the other hand assigns probability one to each event that occurred, which means the vector of probabilities $\hat{p}_i$ is just equal to $y_i$ for each $i$ and we can write the ratio of these likelihoods as

$$ \prod_{i=1}^{n} \prod_{j=1}^{k} \left ( \frac{\hat{p}_j(x_i)}{y_{ij}} \right )^{y_{ij}} . $$

To find the deviance we take minus two times the log of this quantity (this transformation has importance in mathematical statistics because of a connection with the $\chi^2$ distribution) to get

$$ -2 \sum_{i=1}^{n} \sum_{j=1}^{k} y_{ij} \log \left ( \frac{\hat{p}_j(x_i)}{y_{ij}} \right ) . $$

(It's also worth pointing out that we treat zero times the log of anything as zero in this situation. The reason for this is that it's consistent with the idea that the saturated likelihood should equal one.)

The only part of this that's peculiar to glmnet is the way in which the function $\hat{p}(x)$ is estimated. It's doing a constrained maximization of the likelihood and computing the deviance as the upper bound on $\| \beta \|_1$ is varied, with the model that achieves the smallest deviance on test data being considered a "best" model.

Regarding the question about log loss, we can simplify the multinomial deviance above by keeping only the non-zero terms and write it as $-2 \sum_{i=1}^{n} \log [\hat{p}_{j_i} (x_i)]$, where $j_i$ is the index of the observed class for observation $i$, which is just the empirical log loss multiplied by a constant. So minimizing the deviance is actually equivalent to minimizing the log loss.