Implementing multiclass logistic regression from scratch

logistic regressionmachine learningmultivariable-calculusoptimizationstatistics

This is a sequel to a previous question about implementing binary logistic regression from scratch.

Background knowledge:

To train a logistic regression model for a classification problem with $K$ classes, we are given a training dataset consisting of feature vectors $x_1, x_2, \ldots, x_N \in \mathbb R^d$ and corresponding one-hot encoded label vectors $y_1, y_2, \ldots, y_N \in \mathbb R^K$. If example $i$ belongs to class $k$, then the label vector $y_i$ has a $1$ in the $k$th position and zeros elsewhere. Our goal is to find vectors $\beta_1, \beta_2, \ldots, \beta_K \in \mathbb R^{d+1}$ such that
$$
\tag{1} y_i \approx S\left(\begin{bmatrix} \hat x_i^T \beta_1 \\ \vdots \\ \hat x_i^T \beta_K \end{bmatrix} \right) \quad \text{for } i = 1, \ldots, N.
$$

Here $\hat x_i \in \mathbb R^{d+1}$ is the "augmented" feature vector obtained by prepending a $1$ to the feature vector $x_i \in \mathbb R^d$, and $S:\mathbb R^K \to \mathbb R^K$ is the softmax function defined by
$$
S(u) = \begin{bmatrix} \frac{e^{u_1}}{e^{u_1} + \cdots + e^{u_K}} \\ \vdots \\ \frac{e^{u_K}}{e^{u_1} + \cdots + e^{u_K}} \end{bmatrix} \quad \text{for all vectors } u = \begin{bmatrix} u_1 \\ \vdots \\ u_K \end{bmatrix} \in \mathbb R^K.
$$

The softmax function is useful in machine learning because it converts a vector into a "probability vector" (that is, a vector whose components are nonnegative and sum to $1$).
Equation (1) can be expressed more concisely using vector and matrix notation as follows. Define
$$
\beta = \begin{bmatrix} \beta_1 \\ \vdots \\ \beta_K \end{bmatrix} \in \mathbb R^{K(d+1)} \quad \text{and} \quad M_i = \begin{bmatrix} \hat x_i^T & & & \\ & \hat x_i^T & & \\ & & \ddots & \\ & & & \hat x_i^T \end{bmatrix} \in \mathbb R^{K \times K(d+1)}.
$$

Then equation (1) says that we hope that
$$
y_i \approx S(M_i \beta) \quad \text{for } i = 1, \ldots N.
$$

I think of $S(M_i \beta) \in \mathbb R^K$ as being a "predicted probability vector" which tells you how strongly our model believes that example $i$ belongs to each of the $K$ classes. And $y_i$ is
a "ground truth" probability vector which reflects certainty about which of the $K$ classes example $i$ belongs to.

We will use the cross-entropy loss function
$$
\ell(p,q) = \sum_{k=1}^K -p_k \log(q_k)
$$

to measure how well a predicted probability vector $q \in \mathbb R^K$ agrees with a ground truth probability vector $p \in \mathbb R^K$. Here $\log$ denotes the natural logarithm. The vector $\beta$ will be chosen to minimize the average cross-entropy
$$
\tag{2} L(\beta) = \frac{1}{N} \sum_{i=1}^N \ell(y_i, S(M_i \beta)).
$$

We can minimize $L(\beta)$ using an optimization algorithm such as gradient descent, stochastic gradient descent, or Newton's method. In order to use such methods, we need to know how to compute the gradient of $L$. For Newton's method, we also need to know how to compute the Hessian of $L$.

Question:

How can we compute the gradient and the Hessian of the average cross-entropy function $L(\beta)$ defined in equation (2)?

The goal is to compute the gradient and the Hessian of $L$ with finesse, making good use of vector and matrix notation.

(I'll post an answer below.)

Best Answer

$ \def\bbR#1{{\mathbb R}^{#1}} \def\a{\alpha}\def\b{\beta}\def\g{\gamma}\def\t{\theta} \def\l{\lambda}\def\s{\sigma}\def\e{\varepsilon} \def\n{\nabla}\def\o{{\tt1}}\def\p{\partial} \def\E{R}\def\F{{\cal F}}\def\L{{\cal L}} \def\L{\left}\def\R{\right} \def\LR#1{\L(#1\R)} \def\BR#1{\Big(#1\Big)} \def\vec#1{\operatorname{vec}\LR{#1}} \def\diag#1{\operatorname{diag}\LR{#1}} \def\Diag#1{\operatorname{Diag}\LR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\hess#1#2#3{\frac{\p^2 #1}{\p #2\,\p #3}} \def\c#1{\color{red}{#1}} \def\m#1{\left[\begin{array}{r}#1\end{array}\right]} $It will be convenient to use $\{\e_k,\o,J,I\}$ as the standard basis vector, the all-ones vector, all-ones matrix, and identity matrix, respectively. And to use $\{\odot,\oslash\}$ to denote elementwise multiplication and division.

Further, the indexed column vectors can be collected into matrices, e.g. $$\eqalign{ X &= \m{x_\o&\ x_2&\ldots&\ x_N} &\qiq x_i = X\e_i \\ Y &= \m{y_\o&\ y_2&\ldots&\ y_N} &\qiq y_j = Y\e_j \\ B &= \m{\b_\o&\b_2&\ldots&\b_K} &\qiq \b_k = B\e_k \\ }$$

The various functions can be defined for the vector argument $u$. $$\eqalign{ e &= \exp(u) &\qiq de = e\odot du \\ s &= \operatorname{Softmax}(u) \\ &= \frac{e}{\o:e} = e \oslash Je &\qiq ds = \BR{\Diag{s}-ss^T}\,du \\ \l &= \operatorname{log-sum-exp}(u) \\ &= \log(\o:e) &\qiq d\l = s:du \\ }$$ $$\eqalign{ \log(s) &= \log(e)-\log(Je) \;&=\; u -\l\o \qquad\qquad\qquad\qquad\qquad\qquad \\ \ell(y,s) &= -y:\log(s) \;&=\; \l{y:\o} - y:u \;=\; \l - y:u \\ }$$ where a colon denotes the trace/Frobenius product, i.e. $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \big\|A\big\|^2_F \\ }$$ The properties of the underlying trace function allow the terms in a Frobenius product to be rearranged in many different ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:AB &= CB^T:A = A^TC:B \\ }$$ Furthermore, it commutes with elementwise multiplication and division $$\eqalign{ A:(B\odot C) &= (A\odot B):C &= \sum_{i=1}^m\sum_{j=1}^n A_{ij} B_{ij} C_{ij} \\ }$$

The Kronecker product $(\otimes)$ can also be used to good effect since $$\eqalign{ M_i &= \LR{I\otimes x_i^T} \\ M_i\b &= \LR{I\otimes x_i^T} \vec{B} = \vec{\e_i^TX^TB} = {B^TX\e_i} \\ y_i &\approx \operatorname{Softmax}(M_i\b) = \exp\LR{B^TX\e_i}\oslash J\exp\LR{B^TX\e_i} \\ Y &\approx \exp\LR{B^TX}\oslash J\exp\LR{B^TX} \;\doteq\; S \\ }$$ The last equation is the Matrix Model that must be optimized for $B$ with respect to the cross-entropy loss function. Note that we now have matrix analogs of all of the vector quantities $$\eqalign{ u &\to U=B^TX \\ e &\to E=\exp(U) \\ s &\to S=\operatorname{Softmax}(U) &= E\oslash JE \\ &&= E\cdot\c{\Diag{E^T\o}^{-1}} \\ &&= E\c{D^{-1}} \\ y &\to Y,\quad Y^T\o=\o,\;&JY=J,\quad Y:J=N \\ }$$ Calculate the gradient of the loss function. $$\eqalign{ {\ell} &= -\BR{Y:\log(S)} \\ &= Y:\log(JE) - Y:U \\ \\ d{\ell} &= Y:d\log(JE) - Y:dU \\ &= Y:(J\,dE\oslash JE) - Y:dB^TX \\ &= (Y\oslash JE):(J\,dE) - Y^T:X^TdB \\ &= \LR{Y{D^{-1}}}:\BR{J\,dE} - XY^T:dB \\ &= JY{D^{-1}}:dE - XY^T:dB \\ &= JD^{-1}:(E\odot dU) - XY^T:dB \\ &= (E\odot JD^{-1}):dB^TX - XY^T:dB \\ &= \LR{ED^{-1}}^T:X^TdB - XY^T:dB \\ &= X\LR{ED^{-1}}^T:dB - XY^T:dB \\ \\ \grad{\ell}{B} &= X\LR{ED^{-1} - Y}^T \\ &= X\LR{S - Y}^T \;\doteq\; G \\ }$$ This $G$ matrix can be used in any standard gradient descent algorithm.
The Hessian will be a fourth-order tensor and much uglier.

Update

Here is an attempt at a Hessian calculation (with flattening via vectorization).

First, recall some identities involving the Diag() operator $$\eqalign{ &E\odot JP = E\cdot\Diag{P^T\o} \\ &H = \Diag{h} \iff h = H\o \\ &\Diag{H\o} = \Diag{h} = H \qiq &\Diag{H^n\o} = H^n \\ }$$ Calculate the differential of $G$. $$\eqalign{ G &= X\LR{ED^{-1}-Y}^T \\ dG &= X\LR{dE\;D^{-1}}^T - X\LR{E\,dD\,D^{-2}}^T \\ &= ​XD^{-1}\,dE^T - XD^{-2}\,dD\,E^T \\ &= ​XD^{-1}\LR{E^T\odot dU^T} - XD^{-2}\Diag{dE^T\o}\,E^T \\ &= ​XD^{-1}\LR{\c{E^T\odot X^TdB}} - XD^{-2}\Diag{(\c{E^T\odot X^TdB})\o}\,E^T \\ }$$ Note the following vec expansion of the $\c{\rm red}$ term $$\eqalign{ &\E = \Diag{\vec{E^T}} \\ &\vec{\c{E^T\odot X^TdB}} = \E\cdot\vec{X^TdB} = \E \LR{I\otimes X^T} d\b \\ }$$ and use this to vectorize the differential relationship and recover the Hessian. $$\eqalign{ dg &= \vec{dG} \\ &= \LR{I\otimes XD^{-1}} \vec{\c{E^T\odot X^TdB}} - \LR{E\boxtimes XD^{-2}} \LR{\c{E^T\odot X^TdB}}\o \\ &= \BR{\LR{I\otimes XD^{-1}} - \o^T\otimes\LR{E\boxtimes XD^{-2}}} \;\vec{\c{E^T\odot X^TdB}} \\ &= \BR{\LR{I\otimes XD^{-1}} - \o^T\otimes\LR{E\boxtimes XD^{-2}}}\,\E\LR{I\otimes X^T} d\b \\ \grad{g}{\b} &= \BR{\LR{I\otimes XD^{-1}} - \o^T\otimes\LR{E\boxtimes XD^{-2}}}\,\E\LR{I\otimes X^T} \;=\; \hess{\ell}{\b\,}{\b^T} \\ }$$ where $\{\boxtimes\}$ denotes the Khatri-Rao product which can be expanded in terms of Hadamard and Kronecker products $$\eqalign{ B\boxtimes A &= (B\otimes\o_m)\odot(\o_p\otimes A) \;\in \bbR{(pm)\times n} \\ \o_p &\in \bbR{p\times\o} \qquad A \in \bbR{m\times n} \qquad B \in \bbR{p\times n} \\\\ }$$ Note that the gradient can also be vectorized, if that is preferable $$\vec{G} \;=\; g \;=\; \grad{\ell}{\b}$$