Exponential matrix derivative to find the Hessian matrix of negative binomial regression

derivativeshessian-matrixmatrix exponentialmatrix-calculusnegative binomial

I am looking for the hessian matrix of the log likelihood function of negative binomial regression

$$l\left( \cdot \right) =\sum ^{n}_{i=1}y_{i}\ln \left( \dfrac{\alpha \exp \left( x_{i}^{T}\beta \right) }{1+\alpha \exp \left( x_{i}^{T}\beta \right) }\right) -\dfrac{1}{\alpha }\ln \left( 1+\alpha \exp \left( x_{i}^{T}\beta \right) \right) +\ln \Gamma \left( y_{i}+\dfrac{1}{\alpha }\right) -\ln \Gamma \left( y_{i}+1\right) -\ln \Gamma \left( \dfrac{1}{\alpha }\right)$$
In Hilbe (2011)
$$\dfrac{\partial l\left( \cdot \right) }{\partial \beta _{k}}= \sum ^{n}_{i=1}\left( \dfrac{\left( y_{i}-\exp \left( x_{i}^{T}\beta \right) \right) x_{i}}{\left( 1+\alpha \exp \left( x_{i}^{T}\beta \right) \right) }\right)$$

Here i’m confused, where does the derivative take the form of $x_{i}$ and not $x_{i}^{T}$
$$-\dfrac{\partial ^{2}l\left( \cdot \right) }{\partial \beta \partial \beta ^{T}}=\sum ^{n}_{i=1}\left( y_{i}\alpha +1\right) x_{i}x_{i}^{T}\left( \dfrac{\exp \left( x_{i}^{T}\beta \right) }{\left( 1+\alpha \exp \left( x_{i}^{T}\beta \right) \right) ^{2}}\right)$$
At this point, I don't understand about derive against $\partial \beta _{k}^{T}$ .
I would be very grateful if you would describe the calculation process. I will also be happy if you recommend books or papers regarding derivatives of the exponential matrix and derive against $\partial \beta _{k}$ or $\partial \beta _{k}^{T}$ .

Best Answer

$ \def\BR#1{\Big(#1\Big)} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\diag#1{\op{diag}\LR{#1}} \def\Diag#1{\op{Diag}\LR{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\a{\alpha}\def\b{\beta} \def\l{\lambda}\def\s{\sigma} \def\o{{\tt1}}\def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\hess#1#2{\frac{\p^2 #1}{\p #2^2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} \def\A{A^{-1}} \def\S{S^{-1}} $The Frobenius product $(:)$ is extremely useful in Matrix Calculus $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \|A\|^2_F \\ }$$ This is also called the double-dot or double contraction product.
When applied to vectors $(n=\o)$ it reduces to the standard dot product.

The properties of the underlying trace function allow the terms in a Frobenius product to be rearranged in many fruitful ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:\LR{AB} &= \LR{CB^T}:A &= \LR{A^TC}:B \\ }$$ It also commutes with the elementwise/Hadamard product $(\odot)$ $$A:\LR{B\odot C} = \LR{A\odot B}:C\\$$


For typing convenience, define the variables (all functions are element-wise) $$\eqalign{ X &= \big[x_1\;\;x_2\;\cdots\;x_n\big] \\ a &= \a\o,\;b=\b,\;w=\frac{\o}{a} \\ z &= X^Tb + \log(a) &\qiq dz = X^Tdb \\ e &= \exp(z) = \a\exp(X^Tb) &\qiq de = e\odot dz \\ s &= \s(z) = \frac{e}{\o+e} &\qiq \big({\rm Logistic\;function}\big) \\ S &= \Diag s &\qiq ds = \LR{S-S^2} dz \\ Y &= \Diag y,\;W\!=\!\Diag w \\ }$$ The derivative of the Logistic function shown above is well known.

Only the first two terms of the log likelihood expression contain $\b,\,$ therefore create a truncated function and calculate its differential and gradient. $$\eqalign{ \l &= y:\log(s) - w:\log(\o+e) \\ d\l &= y:\fracLR{ds}{s} - w:\fracLR{de}{\o+e} \\ &= y:\LR{\S ds} - w:\fracLR{e\odot dz}{\o+e} \\ &= y:\LR{I-S}\c{dz} - w:S\:\c{dz} \\ &=\BR{Iy-Sy-Sw}:\c{X^Tdb} \\ &= X\BR{y-Sy-Sw}:db \\ &= X\BR{y-Ys-Ws}:db \\ g\;=\; \grad{\l}{b} &= X\BR{y-\LR{Y+W}s} \\ }$$ Next, calculate the gradient of $g$, i.e. the Hessian. $$\eqalign{ dg &= -X(Y+W)\:ds \\ &= -X(Y+W)\LR{S-S^2}dz \\ &= -X(Y+W)\LR{S-S^2}X^Tdb \\ H= \hess{\l}{b} \;=\;\grad{g}{b} &= -\c{X}(Y+W)\LR{S-S^2}\c{X^T} \\ }$$ Notice that the terms sandwiched between $X$ and $X^T$ are diagonal matrices, therefore $H$ is symmetric (as it should be).

Some people are overly pedantic and write $(\p b^2)$ as $(\p b\:\p b^T)$ which properly conveys the shape of the Hessian matrix.


As for book recommendations, the standard text is probably Magnus and Neudecker's Matrix Differential Calculus, although personally I prefer Hjorungnes's Complex-Valued Matrix Derivatives.

For simply looking up formulas, if you cannot find it online in Petersen and Pedersen's Matrix Cookbook then consult Bernstein's Matrix Mathematics: Theory, Facts, and Formulas.

Related Question