[Math] How is the log-likelihood for a multinomial logistic regression calculated

log likelihoodlogistic regressionregressionstatistics

In a multinomial logistic regression, the predicted probability $\pi$ of each outcome $j$ (in a total of $J$ possible outcomes) is given by:

$
\pi_j = \frac{e^{A_j}}{1+\sum_{g \neq j}^Je^{A_j}}
$

where the value $A_j$ is predicted by a series of predictor variables. For instance, here it is predicted by two covariates ($x_1$ and $x_2$), with their associated regression slopes $\beta_1$ and $\beta_2$, and the interaction between the two covariates (with associated regression slope $\beta_{12}$):

$
A_j = e^{(\alpha_j+\beta_{1,j}x_1+\beta_{2,j}x_2+\beta_{12,j}x_1x_2)}
$

The model needs to be fitted to real data, and we will want to know how well it fits. For instance, perhaps out of a draw of 10 balls from a sack, 5 were red, 2 were green, and 3 were yellow. We are interested in whether the variables $x_1$ and $x_2$ associated with the sack allow us to predict this result if we plug in the values for $\beta_1$, $\beta_2$, and $\beta_{12}$ from the fitted model.

To obtain a measure of the goodness-of-fit of the model, we need to calculate the log-likelihood formula for a multinomial logistic regression. I am unsure how to go about this. What is the formula for log-likelihood in a multinomial logistic regression of the kind described above?

Best Answer

Let us assume that the probability of observing $\boldsymbol{x}$ under the condition that we are in class $\mathcal{C}_i$ is given by

$$p(\boldsymbol{x}|\mathcal{C}_i)=\dfrac{\exp(-\boldsymbol{w}^T_i\boldsymbol{x})}{\sum_{l=1}^{J}\exp(-\boldsymbol{w}^T_l\boldsymbol{x})}.$$

If we get data $\mathcal{D}=\{(\boldsymbol{x}_1,\boldsymbol{t}_1),\ldots,(\boldsymbol{x}_N,\boldsymbol{t}_N) \}$, in which $\boldsymbol{t}$ is encoding the class as one-hot-encoding vector. For three classes $\boldsymbol{t}=[0,0,1]^T$ signifies that the corresponding observation $\boldsymbol{x}$ is from the third class.

We are assuming that all observations are drawn independently (this allows us to multiply all probability distributions) from the same distribution (this allows using the same probability distribution for all observations). We rewrite the previous probability by using $\boldsymbol{t}=[t_1,\ldots,t_J]^T$. Note, that these are scalars (not bold face).

$$p(\boldsymbol{x}|\boldsymbol{t})=\prod_{j=1}^{J}\left[\dfrac{\exp(-\boldsymbol{w}^T_i\boldsymbol{x})}{\sum_{l=1}^{J}\exp(-\boldsymbol{w}^T_l\boldsymbol{x})}\right]^{t_j}.$$

The likelihood to observe the data $\mathcal{D}$ is given by

$$p(\boldsymbol{x}_1,\ldots,\boldsymbol{x}_N|\boldsymbol{t}_1,\ldots,\boldsymbol{t}_N)=\prod_{n=1}^{N}\prod_{j=1}^{J}\left[\dfrac{\exp(-\boldsymbol{w}^T_i\boldsymbol{x}_n)}{\sum_{l=1}^{J}\exp(-\boldsymbol{w}^T_l\boldsymbol{x}_n)}\right]^{t_{nj}}.$$

Hence, the log-likelihood is given by

$$\log p(\boldsymbol{x}_1,\ldots,\boldsymbol{x}_N|\boldsymbol{t}_1,\ldots,\boldsymbol{t}_N)=\sum_{n=1}^{N}\sum_{j=1}^{J} t_{nj} \log\left[\dfrac{\exp(-\boldsymbol{w}^T_i\boldsymbol{x}_n)}{\sum_{l=1}^{J}\exp(-\boldsymbol{w}^T_l\boldsymbol{x}_n)}\right],$$

in which $t_{nj}$ is the $j^{\text{th}}$ component of the class vector $\boldsymbol{t}_n$ for the $n^\text{th}$ observation $\boldsymbol{x}_n$.

Related Question