Solved – Hessian matrix for maximum likelihood

generalized linear modelmathematical-statisticsmaximum likelihood

Here's a question from my problem sheet.

For the normal linear model, verify that the MLEs $\boldsymbol{\hat{\beta}}$ and $\tilde{\sigma}^2$ are maximal values for $\ell(\beta, \sigma^2;\mathbf{y})$ with respect to $\beta$ and $\sigma$, where $\ell$ denotes the log likelihood. What is the maximum value of the likelihood $L(\beta, \sigma^2,y)$? That is: Compute $\max_{\beta,\sigma^2}L(\boldsymbol{\beta},\sigma^2;\mathbf{y})$, where $\mathbf{y}$ is the vector of observations.

I have tried to solve this question but I am confused at the solution, mostly the Hessian.

The Hessian $H(\boldsymbol{\beta},\sigma^2)$ gives
\begin{pmatrix}
\dfrac{\partial}{\partial \boldsymbol{\beta}^T} \left[ \dfrac{\partial \ell(\boldsymbol{\beta},\sigma^2;\mathbf{y})}{\partial \boldsymbol{\beta}} \right] & \dfrac{\partial}{\partial \sigma^2} \left[ \dfrac{\partial \ell(\boldsymbol{\beta},\sigma^2;\mathbf{y})}{\partial \boldsymbol{\beta}}\right] \\
\dfrac{\partial}{\partial \boldsymbol{\beta}^T} \left[ \dfrac{\partial \ell(\boldsymbol{\beta},\sigma^2;\mathbf{y})}{\partial \sigma^2} \right] & \dfrac{\partial}{\partial \boldsymbol{\sigma}^2} \left[ \dfrac{\partial \ell(\boldsymbol{\beta},\sigma^2;\mathbf{y})}{\partial \sigma^2} \right] \\
\end{pmatrix}
according to the answer.

I have two questions:

  1. How do I know when I need to use the transpose? e.g. why isn't the 1,1th element of the Hessian matrix just $\dfrac{\partial}{\partial \boldsymbol{\beta}}\left[\dfrac{\partial \ell(\boldsymbol{\beta},\sigma^2;\mathbf{y})}{\partial \boldsymbol{\beta}}\right]$?

  2. Why does the 2,1th element of the Hessian have to have the partial differential with respect to $\boldsymbol{\beta}^{T}$ on the outside, not just $\boldsymbol{\beta}$?

Best Answer

You have to remember that since $\pmb{\beta} \in \Re^{n \times 1}$ is a vector, partial derivatives you described are vectors, and matrices. Especially the hessian

$$H(\pmb{\beta}, \sigma^2) = \begin{pmatrix} \frac{\partial}{\partial \boldsymbol{\beta}^{T}}[\frac{\partial l(\boldsymbol{\beta},\sigma^{2};\mathbf{y})}{\partial \boldsymbol{\beta}}] & \frac{\partial}{\partial \sigma^{2}}[\frac{\partial l(\boldsymbol{\beta},\sigma^{2};\mathbf{y})}{\partial \boldsymbol{\beta}}]\\ \frac{\partial}{\partial \boldsymbol{\beta}^{T}}[\frac{\partial l(\boldsymbol{\beta},\sigma^{2};\mathbf{y})}{\partial \sigma^{2}}] & \frac{\partial}{\partial \boldsymbol{\sigma}^{2}}[\frac{\partial l(\boldsymbol{\beta},\sigma^{2};\mathbf{y})}{\partial \sigma^{2}}] \\ \end{pmatrix} \in \Re^{(n+1) \times (n+1)}$$

and since $\pmb{\beta}$ is a column vector, we have

$$ \frac{\partial}{\partial \boldsymbol{\beta}^{T}}\Big[\frac{\partial l(\boldsymbol{\beta},\sigma^{2};\mathbf{y})}{\partial \boldsymbol{\beta}}\Big] \in \Re^{n \times n} \text{, is a matrix}$$

$$\frac{\partial}{\partial \sigma^{2}}\Big[\frac{\partial l(\boldsymbol{\beta},\sigma^{2};\mathbf{y})}{\partial \boldsymbol{\beta}}\Big] \in \Re^{n \times 1} \text{, is a column vector}$$

$$\frac{\partial}{\partial \boldsymbol{\beta}^{T}}\Big[\frac{\partial l(\boldsymbol{\beta},\sigma^{2};\mathbf{y})}{\partial \sigma^{2}}\Big] \in \Re^{1 \times n} \text{, is a row vector}$$

So we simply take partial derivates w.r.t $\pmb{\beta}$ or $\pmb{\beta^T}$ so that they "fit" in a matrix. To visualize it better

$$\frac{\partial}{\partial \boldsymbol{\beta}^{T}}\Big[\frac{\partial l(\boldsymbol{\beta},\sigma^{2};\mathbf{y})}{\partial \boldsymbol{\beta}}\Big] = \frac{\partial}{\partial \boldsymbol{\beta}^{T}} \begin{pmatrix} \frac{\partial l(\boldsymbol{\beta},\sigma^{2};\mathbf{y})}{\partial \boldsymbol{\beta_1}} \\ \vdots \\ \frac{\partial l(\boldsymbol{\beta},\sigma^{2};\mathbf{y})}{\partial \boldsymbol{\beta_n}} \end{pmatrix} = \begin{pmatrix} \frac{\partial l(\boldsymbol{\beta},\sigma^{2};\mathbf{y})}{\partial \boldsymbol{\beta_1}\partial \boldsymbol{\beta_1}}& \frac{\partial l(\boldsymbol{\beta},\sigma^{2};\mathbf{y})}{\partial \boldsymbol{\beta_1}\partial \boldsymbol{\beta_2}}& \dots & \frac{\partial l(\boldsymbol{\beta},\sigma^{2};\mathbf{y})}{\partial \boldsymbol{\beta_1}\partial \boldsymbol{\beta_n}} \\ \frac{\partial l(\boldsymbol{\beta},\sigma^{2};\mathbf{y})}{\partial \boldsymbol{\beta_2}\partial \boldsymbol{\beta_1}} & \ddots & & \vdots \\ \vdots& \\ \frac{\partial l(\boldsymbol{\beta},\sigma^{2};\mathbf{y})}{\partial \boldsymbol{\beta_n}\partial \boldsymbol{\beta_1}} & \dots & & \frac{\partial l(\boldsymbol{\beta},\sigma^{2};\mathbf{y})}{\partial \boldsymbol{\beta_n}\partial \boldsymbol{\beta_n}} \end{pmatrix}$$