Solved – Hessian of logistic function

logistic

I have difficulty to derive the Hessian of the objective function, $l(\theta)$, in logistic regression where $l(\theta)$ is:
$$
l(\theta)=\sum_{i=1}^{m} \left[y_{i} \log(h_\theta(x_{i})) + (1- y_{i}) \log (1 – h_\theta(x_{i}))\right]
$$

$h_\theta(x)$ is a logistic function. The Hessian is $X^T D X$. I tried to derive it by calculating $\frac{\partial^2 l(\theta)}{\partial \theta_i \partial \theta_j}$, but then it wasn't obvious to me how to get to the matrix notation from $\frac{\partial^2 l(\theta)}{\partial \theta_i \partial \theta_j}$.

Does anybody know any clean and easy way of deriving $X^T D X$?

Best Answer

Here I derive all the necessary properties and identities for the solution to be self-contained, but apart from that this derivation is clean and easy. Let us formalize our notation and write the loss function a little more compactly. Consider $m$ samples $\{x_i,y_i\}$ such that $x_i\in\mathbb{R}^d$ and $y_i\in\mathbb{R}$. Recall that in binary logistic regression we typically have the hypothesis function $h_\theta$ be the logistic function. Formally

$$h_\theta(x_i)=\sigma(\omega^Tx_i)=\sigma(z_i)=\frac{1}{1+e^{-z_i}},$$

where $\omega\in\mathbb{R}^d$ and $z_i=\omega^Tx_i$. The loss function (which I believe OP's is missing a negative sign) is then defined as:

$$l(\omega)=\sum_{i=1}^m -\Big( y_i\log\sigma(z_i)+(1-y_i)\log(1-\sigma(z_i))\Big)$$

There are two important properties of the logistic function which I derive here for future reference. First, note that $1-\sigma(z)=1-1/(1+e^{-z})=e^{-z}/(1+e^{-z})=1/(1+e^z)=\sigma(-z)$.

Also note that

\begin{equation} \begin{aligned} \frac{\partial}{\partial z}\sigma(z)=\frac{\partial}{\partial z}(1+e^{-z})^{-1}=e^{-z}(1+e^{-z})^{-2}&=\frac{1}{1+e^{-z}}\frac{e^{-z}}{1+e^{-z}} =\sigma(z)(1-\sigma(z)) \end{aligned} \end{equation}

Instead of taking derivatives with respect to components, here we will work directly with vectors (you can review derivatives with vectors here). The Hessian of the loss function $l(\omega)$ is given by $\vec{\nabla}^2l(\omega)$, but first recall that $\frac{\partial z}{\partial \omega} = \frac{x^T\omega}{\partial \omega}=x^T$ and $\frac{\partial z}{\partial \omega^T}=\frac{\partial \omega^Tx}{\partial \omega ^T} = x$.

Let $l_i(\omega)=-y_i\log\sigma(z_i)-(1-y_i)\log(1-\sigma(z_i))$. Using the properties we derived above and the chain rule

\begin{equation} \begin{aligned} \frac{\partial \log\sigma(z_i)}{\partial \omega^T} &= \frac{1}{\sigma(z_i)}\frac{\partial\sigma(z_i)}{\partial \omega^T} = \frac{1}{\sigma(z_i)}\frac{\partial\sigma(z_i)}{\partial z_i}\frac{\partial z_i}{\partial \omega^T}=(1-\sigma(z_i))x_i\\ \frac{\partial \log(1-\sigma(z_i))}{\partial \omega^T}&= \frac{1}{1-\sigma(z_i)}\frac{\partial(1-\sigma(z_i))}{\partial \omega^T} =-\sigma(z_i)x_i \end{aligned} \end{equation}

It's now trivial to show that

$$\vec{\nabla}l_i(\omega)=\frac{\partial l_i(\omega)}{\partial \omega^T} =-y_ix_i(1-\sigma(z_i))+(1-y_i)x_i\sigma(z_i)=x_i(\sigma(z_i)-y_i)$$

whew!

Our last step is to compute the Hessian

$$\vec{\nabla}^2l_i(\omega)=\frac{\partial l_i(\omega)}{\partial \omega\partial \omega^T}=x_ix_i^T\sigma(z_i)(1-\sigma(z_i))$$

For $m$ samples we have $\vec{\nabla}^2l(\omega)=\sum_{i=1}^m x_ix_i^T\sigma(z_i)(1-\sigma(z_i))$. This is equivalent to concatenating column vectors $x_i\in\mathbb{R}^d$ into a matrix $X$ of size $d\times m$ such that $\sum_{i=1}^m x_ix_i^T=XX^T$. The scalar terms are combined in a diagonal matrix $D$ such that $D_{ii}=\sigma(z_i)(1-\sigma(z_i))$. Finally, we conclude that

$$ \vec{H}(\omega)=\vec{\nabla}^2l(\omega)=XDX^T$$

A faster approach can be derived by considering all samples at once from the beginning and instead work with matrix derivatives. As an extra note, with this formulation it's trivial to show that $l(\omega)$ is convex. Let $\delta$ be any vector such that $\delta\in\mathbb{R}^d$. Then

$$\delta^T\vec{H}(\omega)\delta = \delta^T\vec{\nabla}^2l(\omega)\delta = \delta^TXDX^T\delta = \delta^TXD(\delta^TX)^T = \|\delta^TDX\|^2\geq 0$$

since $D>0$ and $\|\delta^TX\|\geq 0$. This implies $H$ is positive-semidefinite and therefore $l$ is convex (but not strongly convex).