Solved – Derivation of Hessian for multinomial logistic regression in Böhning (1992)

derivativehessianlinear algebralogisticnotation

This question is basically about row/column notation of derivatives and some basic rules. However, I couldn't figure out where I'm wrong.

For multinomial logistic regression, I'm trying to get the following observed information given in Böhning's paper [1]:
$$
-\nabla^2L=\begin{pmatrix}
p_1(1-p_1)\mathbf{xx}^T & -p_1p_2\mathbf{xx}^T & \dots & -p_1p_k\mathbf{xx}^T \\ \vdots & p_2(1-p_2)\mathbf{xx}^T & & \vdots \\
-p_kp_1\mathbf{xx}^T & \dots & & p_k(1-p_k)\mathbf{xx}^T
\end{pmatrix}
$$

where $p_i=p(y_i=1|\mathbf{x})=\frac{\exp(w_i^T\mathbf{x})}{\sum_{j=1}^K\exp{w_j^T\mathbf{x}}}$ for $i=1,…,K$, $\mathbf{x}$ is a single observation (a column vector of size $d$), $\omega_i$ is the parameter vector of $i$th class (a column vector of size $d$). $y_i \in \lbrace 0, 1 \rbrace$, which is the class label.

I can write the log-likelihood (for a single observation) of as follows:
$$
\mathcal{L}=\sum_{k=1}^K y_{k}\omega_k^T\mathbf{x}-\ln\sum_{j=1}^K\exp{(\omega_j^T\mathbf{x})}
$$

According to Matrix Cookbook, $\frac{\partial \omega_k^T \mathbf{x}}{\partial \omega_k}=\mathbf{x}$. By using this, I wrote the following first and second derivatives:

$$
\frac{\partial \mathcal{L}}{\partial \omega_k}=\left( y_{k} -\frac{\exp(\omega_k^T\mathbf{x})}{\sum_{j=1}^K\exp{(\omega_j^T\mathbf{x})}} \right)\mathbf{x}
\\
\frac{\partial \mathcal{L}}{\partial \omega_k \partial \omega_k}=-\left( \frac{\exp(\omega_k^T\mathbf{x})}{\sum_{j=1}^K\exp{(\omega_j^T\mathbf{x})}}-\frac{(\exp(\omega_k^T\mathbf{x}))^2}{(\sum_{j=1}^K\exp{(\omega_j^T\mathbf{x})})^2} \right)\mathbf{x} \mathbf{x}
$$

So instead of $\mathbf{xx}^T$, I get $\mathbf{xx}$. How can I correct this?

In some sources like this one the second derivative is defined as $\frac{\partial^2\mathcal{L}}{\partial\omega_k\partial\omega_k^T}$. Is this the correct notation? (Even if it is, I still can't correct my derivatives.)

[1] D. Böhning, "Multinomial Logistic Regression Algorithm", Ann. Inst. Statist. Math, vol. 44, No. 1, 197-200, 1992.

Best Answer

The source of the issue in my view comes from a confusion concerning dimensionality, and because the Hessian departs from the usual context in that there is sub-partitioning going on. Other than a mild bit of calculus, this is just routine linear algebra with some tricks like the Kronecker delta function and the Kronecker product to compress the notation.

I have no doubt there may be a quicker and cleaner solution to this using a suitable selection of recipes from the Matrix Cookbook, or Einstein notation. But I've generally found that unless it's immediately obvious which recipe you should be applying, then there is no harm in building the condensed abstractions presented piece-by-piece from the ground up, particularly when you are in doubt.

For this reason, I will derive the relevant partial derivatives, and show how these can be stacked into the data-structures (gradient and Hessian) whose expressions you are seeking to validate.

Notation and dimensionality.

$\mathbf{x} \in \mathbb{R}^D$ is an observation vector.

$\mathbf{y} \in \mathbb{R}^K$ is also a vector.

$\omega^{(k)} \in \mathbb{R}^D$ for $k = 1, \dots, K$ are parameter vectors., where

$$\omega^{(k)} = \begin{bmatrix} \omega_1^{(k)} \\ \vdots \\ \omega_D^{(k)} \end{bmatrix}$$

$\omega \in \mathbb{R}^{DK}$ is a $DK$-dimensional column vector, consisting of all of the column parameter vectors $\omega^{(1)}, \dots, \omega^{(K)}$ stacked vertically on top of one another. Where for each element $\omega_d^{(k)}$, the superscript index $k$ indexes the $k$th class, and the subscript index $d$ indexes the covariate element.

The log-likelihood $\mathcal{L}(\omega)$ is a scalar function, in that $\mathcal{L} : \mathbb{R}^{DK} \rightarrow \mathbb{R}$. For a single observation $\mathbf{y}$, the log-likelihood is

$$\mathcal{L}(\omega) = \left( \sum^K_{k=1} y_k {\omega^{(k)}}^T \mathbf{x} \right) - \ln \left(1 + \sum^K_{j=1} \exp( {\omega^{(j)}}^T \mathbf{x}) \right).$$

Computing the score vector/gradient $\nabla_{\omega} L(\omega)$.

Note that the gradient $\nabla_{\omega} L(\omega) \in \mathbb{R}^{DK}$ is a $DK$-dimensional column vector.

Each element is the partial derivative of $\mathcal{L}$ with respect to $w_d^{(k)}$. Using dummy indexes $c$ and $l$ to avoid confusion, we have that

\begin{align} \frac{\partial \mathcal{L}}{\partial \omega_c^{(l)}} &= \frac{\partial}{\partial w_c^{(l)}} \left\{ \left( \sum^K_{k=1} y_k {\omega^{(k)}}^T \mathbf{x} \right) - \ln \left[1 + \sum^K_{j=1} \exp( {\omega^{(j)}}^T \mathbf{x}) \right] \right\} \\ &= y_l \cdot \frac{\partial}{\partial w_c^{(l)}} {\omega^{(l)}}^T \mathbf{x} - \frac{\exp( {\omega^{(l)}}^T \mathbf{x})}{1 + \sum^K_{j=1} \exp({\omega^{(j)}}^T \mathbf{x})} \cdot \frac{\partial}{\partial w_c^{(l)}} {\omega^{(l)}}^T \mathbf{x} \\ &= y_l x_c - \frac{\exp( {\omega^{(l)}}^T \mathbf{x})}{1 + \sum^K_{j=1} \exp({\omega^{(j)}}^T \mathbf{x})} x_c \\ &= \left[y_l - \frac{\exp( {\omega^{(l)}}^T \mathbf{x})}{1 + \sum^K_{j=1} \exp({\omega^{(j)}}^T \mathbf{x})} \right] x_c \\ &= (y_l - \hat{p}_l) x_c, \end{align}

where

$$\hat{p}_l = \frac{\exp( {\omega^{(l)}}^T \mathbf{x})}{1 + \sum^K_{j=1} \exp({\omega^{(j)}}^T \mathbf{x})}.$$

Writing out the gradient, we have that:

$$\begin{align} \nabla_{\omega} \mathcal{L}(\omega) = \begin{bmatrix} (y_1 - p_1)x_1 \\ \vdots \\ (y_1 - p_1) x_D \\ (y_2 - p_2) x_1 \\ \vdots \\ (y_2 - p_2) x_D \\ \vdots \\ (y_K - p_K) x_1 \\ \vdots \\ (y_K - p_K) x_D \\ \end{bmatrix} = \begin{bmatrix} (\mathbf{y} - \hat{\mathbf{p}})_1 \mathbf{x} \\ (\mathbf{y} - \hat{\mathbf{p}})_2 \mathbf{x} \\ \vdots \\ (\mathbf{y} - \hat{\mathbf{p}})_K \mathbf{x} \\ \end{bmatrix} = (\mathbf{y} - \hat{\mathbf{p}}) \otimes \mathbf{x}. \end{align}$$

Where $(\mathbf{y} - \hat{\mathbf{p}}) \otimes \mathbf{x}$ is the Kronecker product of the "matrices" $(\mathbf{y} - \hat{\mathbf{p}}) \in \mathbb{R}^{K \times 1}$ and $\mathbf{x} \in \mathbb{R}^{D \times 1}$, resulting in a $DK$-dimensional column vector.

Computing the observed information matrix/negative Hessian $-\nabla^2 \mathcal{L}(\omega)$.

Because the gradient $\nabla_{\omega} L(\omega) \in \mathbb{R}^{DK}$, the Hessian $-\nabla^2 \mathcal{L}(\omega) \in \mathbb{R}^{DK \times DK}.$

Each element of the Hessian, which itself consists of $D^2K^2$ elements is given by

\begin{align} -\frac{\partial^2 \mathcal{L}}{\partial \omega_{c'}^{(l')} \partial \omega_c^{(l)}} &= -\frac{\partial}{\partial \omega_{c'}^{(l')}} \frac{\partial \mathcal{L}}{\partial \omega_c^{(l)}} \\ &= -\frac{\partial}{\partial \omega_{c'}^{(l')}} \left[y_l x_c - \frac{\exp( {\omega^{(l)}}^T \mathbf{x})}{1 + \sum^K_{j=1} \exp({\omega^{(j)}}^T \mathbf{x})} x_c \right].\\ \end{align}

The left-hand term in the square brackets vanishes, and so the above simplifies to

$$-\frac{\partial^2 \mathcal{L}}{\partial \omega_{c'}^{(l')} \partial \omega_c^{(l)}} = \frac{\partial}{\partial \omega_{c'}^{(l')}} \left[\frac{\exp( {\omega^{(l)}}^T \mathbf{x})}{1 + \sum^K_{j=1} \exp({\omega^{(j)}}^T \mathbf{x})} x_c \right].$$

For a function $f = g / h$, the quotient rule gives $f' = (g'h - gh') / h^2$, where $f'$, $g'$, and $h'$ represent partial derivatives.

Now in the case where $l' \neq l$, where $l'$ does not refer to a partial derivative, we have that the partial derivative $g'= 0$, giving

\begin{align} -\frac{\partial^2 \mathcal{L}}{\partial \omega_{c'}^{(l')} \partial \omega_c^{(l)}} &= \frac{-\exp({\omega^{(l)}}^T \mathbf{x}) x_c \cdot \exp({\omega^{(l')}}^T \mathbf{x}) x_{c'}}{\left[1 + \sum^K_{j=1} \exp({\omega^{(j)}}^T \mathbf{x}) \right]^2}. \end{align}

When $l' = l$, then we have that

\begin{align} -\frac{\partial^2 \mathcal{L}}{\partial \omega_{c'}^{(l')} \partial \omega_c^{(l)}} &= \frac{\exp({\omega^{(l)}} ^T \mathbf{x}) x_c x_{c'} \cdot [1 + \sum^K_{j=1} \exp({\omega^{(j)}}^T \mathbf{x}) ]- \exp({\omega^{(l)}}^T \mathbf{x})x_c \cdot \exp({\omega^{(l')}}^T \mathbf{x})x_{c'}}{\left[1 + \sum^K_{j=1} \exp({\omega^{(j)}}^T \mathbf{x}) \right]^2} \\ &= \left( \frac{\exp( {\omega^{(l)}}^T \mathbf{x})}{1 + \sum^K_{j=1} \exp({\omega^{(j)}}^T \mathbf{x})} - \frac{\exp({\omega^{(l)}}^T \mathbf{x}) \cdot \exp({\omega^{(l')}}^T \mathbf{x})}{\left[1 + \sum^K_{j=1} \exp({\omega^{(j)}}^T \mathbf{x}) \right]^2} \right) x_c x_{c'}. \\ \end{align}

The author compresses the cases where $l' \neq l$ and $l' = l$ using the Kronecker delta function, meaning that both cases above can be written generally as

$$-\frac{\partial^2 \mathcal{L}}{\partial \omega_{c'}^{(l')} \partial \omega_c^{(l)}} = (\delta(l, l') \hat{p}_l - \hat{p}_l \hat{p}_{l'}) x_c x_{c'}.$$

With a view to stacking these $D^2K^2$ elements into a $DK \times DK$ Hessian $- \nabla^2 \mathcal{L}$, note that the outer product $\mathbf{x} \mathbf{x}^T$ is a $D \times D$ matrix:

$$\mathbf{x} \mathbf{x}^T = \begin{bmatrix} x_1 \\ \vdots \\ x_D \\ \end{bmatrix} \begin{bmatrix} x_1 \cdots x_D \end{bmatrix} = \begin{bmatrix} x_1^2 & x_1 x_2 &\cdots &x_1 x_D \\ x_2 x_1 & x_2^2 &\cdots &x_2 x_D \\ \vdots & \vdots &\ddots & \vdots \\ x^D x_1 & x_D x_2 &\cdots &x_D^2 \\ \end{bmatrix}$$

And also note that if we cycle through $l = 1, \dots K$ and $l' = 1, \dots K$, storing each element $(\delta(l, l') \hat{p}_l - \hat{p}_l \hat{p}_{l'})$ into a $K \times K$ matrix, which we can refer to as $\Lambda_{\hat{\mathbf{p}}} - \hat{\mathbf{p}} \hat{\mathbf{p}}^T$, then:

$$\Lambda_{\hat{\mathbf{p}}} - \hat{\mathbf{p}} \hat{\mathbf{p}}^T = \begin{bmatrix} \hat{p}_1( 1 - \hat{p}_1) & -\hat{p}_1 \hat{p}_2 &\cdots &-\hat{p}_1 \hat{p}_K \\ -\hat{p}_2 \hat{p}_1 & \hat{p}_2 (1 - \hat{p}_2) &\cdots & -\hat{p}_2 \hat{p}_K \\ \vdots & \vdots &\ddots & \vdots \\ -\hat{p}_K \hat{p}_1 & -\hat{p}_K \hat{p}_2 &\cdots & \hat{p}_K (1 -\hat{p}_K) \\ \end{bmatrix}$$

We can now think of populating the $DK \times DK$ Hessian matrix one $D \times D$ sub-matrix at a time. Indexing a total of $K^2$ individual $D \times D$ sub-matrices with a co-ordinate $(i, j)$, where $i = 1, \dots, K$ and $j = 1, \cdots, K$, the $(i,j)$-th $D \times D$ submatrix is

$$(\Lambda_{\hat{\mathbf{p}}} - \hat{\mathbf{p}} \hat{\mathbf{p}}^T)_{ij} \mathbf{x} \mathbf{x}^T.$$

Populating our Hessian with a total of $K^2$ submatrices, and using the definition of the Kronecker product to compress the notation we have that:

\begin{align} - \nabla^2 \mathcal{L} &= \begin{bmatrix} \hat{p}_1( 1 - \hat{p}_1) \mathbf{x} \mathbf{x}^T & -\hat{p}_1 \hat{p}_2 \mathbf{x} \mathbf{x}^T &\cdots &-\hat{p}_1 \hat{p}_K \mathbf{x} \mathbf{x}^T \\ -\hat{p}_2 \hat{p}_1 \mathbf{x} \mathbf{x}^T & \hat{p}_2 (1 - \hat{p}_2) \mathbf{x} \mathbf{x}^T &\cdots & -\hat{p}_2 \hat{p}_K \mathbf{x} \mathbf{x}^T \\ \vdots & \vdots &\ddots & \vdots \\ -\hat{p}_K \hat{p}_1 \mathbf{x} \mathbf{x}^T & -\hat{p}_K \hat{p}_2 \mathbf{x} \mathbf{x}^T &\cdots & \hat{p}_K (1 -\hat{p}_K) \mathbf{x} \mathbf{x}^T \\ \end{bmatrix} = (\Lambda_{\hat{\mathbf{p}}} - \hat{\mathbf{p}} \hat{\mathbf{p}}^T) \otimes \mathbf{x} \mathbf{x}^T.\\ \end{align}

Where the final equality is a Kronecker product of the matrices $(\Lambda_{\hat{\mathbf{p}}} - \hat{\mathbf{p}} \hat{\mathbf{p}}^T) \in \mathbb{R}^{K \times K}$ and $\mathbf{x}\mathbf{x}^T \in \mathbb{R}^{D \times D}.$

Best Answer

Related Solutions

Solved – Derivative of softmax and squared error

Solved – Derivation of the Hessian of average empirical loss for Logistic Regression

Related Question