Gradient and Hessian of $\sum_i \log \left(1 + \exp\left\{ -t_i \left(w^T x_i\right)\right\} \right) + \mu \|w \|_2^2$

derivativesmatrix-calculusmultivariable-calculus

Is my Gradient and Hessian of the following correct?
\begin{align}
f &= \sum_i \log \left(1 + \exp\left\{ -t_i \left(w^T x_i\right)\right\} \right) + \mu \|w \|_2^2 \ ,
\end{align}

where $t_i \in \mathbb{R}$, $w, x_i \in \mathbb{R}^n$, and $\mu \in \mathbb{R}$.

I want to find the gradient and Hessian w.r.t. $w$, that is $\frac{\partial f}{\partial w}$ and $\frac{\partial^2 f}{\partial w^2}$.


Partial attempt

Gradient

\begin{align}
\frac{\partial f}{\partial w}
&= \sum_i \left( \frac{-t_i x_i \exp\left\{ -t_i \left(w^T x_i\right)\right\}}{1 + \exp\left\{ -t_i \left(w^T x_i\right)\right\} } \right) + 2\mu w \ \ \\
&= \sum_i \left( \frac{-t_i x_i }{1 + \exp\left\{ +t_i \left(w^T x_i\right)\right\} } \right) + 2\mu w \ .
\end{align}

is this Gradient correct?

Hessian

\begin{align}
\frac{\partial^2 f}{\partial w^2}
&= \frac{\partial}{ \partial w} \left[ \sum_i \left( \frac{-t_i x_i }{1 + \exp\left\{ +t_i \left(w^T x_i\right)\right\} } \right) + 2\mu w \right] \\
&= \sum_i \left( \frac{t_i^2 x_i x_i \ \exp\left\{ +t_i \left(w^T x_i\right)\right\}}{\left(1 + \exp\left\{ +t_i \left(w^T x_i\right)\right\}\right)^2 } \right) + 2\mu I \ .
\end{align}

I Think my Hessian is for sure incorrect, isn't it? because I am getting in the numerator of the first part as $x_i x_i$… how would two vectors just multiply 🙁 …

Best Answer

The gradient looks correct, but the hessian doesn't. Here's how I did the calculations.

In order to write the function in purely matrix form, first note that the $\{x_i\}$ vectors are columns of a single matrix $X$. Next use $(\circ)$ to denote the elementwise/Hadamard product and (:) to denote the trace/Frobenius product, i.e. $$A:B = {\rm Tr}(A^TB)$$

Define the following variables. $$\eqalign{ a &= t\circ X^Tw &\implies da = t\circ X^Tdw \cr b &= \exp(-a) &\implies db = -b\circ da \cr p &= \exp(a) &\implies dp = p\circ da \implies 1=b\circ p \cr c &= \log(1+b) &\implies dc = \frac{db}{1+b} \cr }$$ Write the function in terms of these variables. Then calculate its differential and and back-substitute variables until we arrive at the gradient with respect to $w$. $$\eqalign{ f &= \mu\,w:w + 1:c \cr df &= 2\mu\,w:dw + 1:dc \cr &= 2\mu\,w:dw + \frac{1}{1+b}:db \cr &= 2\mu\,w:dw - \frac{1}{1+b}:b\circ da \cr &= 2\mu\,w:dw - \frac{b}{1+b}:t\circ X^Tdw \cr &= 2\mu\,w:dw - X\Big(\frac{t\circ b}{1+b}\Big):dw \cr &= \bigg(2\mu\,w - X\Big(\frac{t}{p+1}\Big)\bigg):dw \cr g = \frac{\partial f}{\partial w} &= 2\mu\,w - X\Big(\frac{t}{1+p}\Big) \cr }$$ Now find the differential and gradient of $g$. $$\eqalign{ dg &= 2\mu\,dw + X\Big(\frac{t\circ dp}{(1+p)\circ(1+p)}\Big) \cr &= 2\mu\,dw + X\Big(\frac{t\circ p\circ da}{1+2p+p\circ p}\Big) \cr &= 2\mu\,dw + X\Big(\frac{t\circ p\circ t\circ X^Tdw}{1+2p+p\circ p}\Big) \cr }$$ Replace the Hadamard products with diagonal matrices, e.g. $$\eqalign{ P &= {\rm Diag}(p),\,\, T &= {\rm Diag}(t),\,\, I &= {\rm Diag}(1) \cr Px &= p\circ x \cr }$$ Therefore $$\eqalign{ dg &= \Big(2\mu I + X(I+2P+P^2)^{-1}T^2PX^T\Big)\,dw \cr H = \frac{\partial g}{\partial w} &= 2\mu I + X(I+2P+P^2)^{-1}T^2PX^T \cr }$$