Hessian Matrix vs Covariance Matrix – Understanding the Relationship

data miningmachine learningmathematical-statisticsmaximum likelihood

While I am studying Maximum Likelihood Estimation, to do inference in Maximum Likelihood Estimaion, we need to know the variance. To find out the variance, I need to know the Cramer's Rao Lower Bound, which looks like a Hessian Matrix with Second Deriviation on the curvature. I am kind of mixed up to define the relationship between covariance matrix and hessian matrix. Hope to hear some explanations about the question. A simple example will be appreciated.

Best Answer

You should first check out this Basic question about Fisher Information matrix and relationship to Hessian and standard errors

Suppose we have a statistical model (family of distributions) $\{f_{\theta}: \theta \in \Theta\}$. In the most general case we have $dim(\Theta) = d$, so this family is parameterized by $\theta = (\theta_1, \dots, \theta_d)^T$. Under certain regularity conditions, we have

$$I_{i,j}(\theta) = -E_{\theta}\Big[\frac{\partial^2 l(X; \theta)}{\partial\theta_i\partial\theta_j}\Big] = -E_\theta\Big[H_{i,j}(l(X;\theta))\Big]$$

where $I_{i,j}$ is a Fisher Information matrix (as a function of $\theta$) and $X$ is the observed value (sample)

$$l(X; \theta) = ln(f_{\theta}(X)),\text{ for some } \theta \in \Theta$$

So Fisher Information matrix is a negated expected value of Hesian of the log-probability under some $\theta$

Now let's say we want to estimate some vector function of the unknown parameter $\psi(\theta)$. Usually it is desired that the estimator $T(X) = (T_1(X), \dots, T_d(X))$ should be unbiased, i.e.

$$\forall_{\theta \in \Theta}\ E_{\theta}[T(X)] = \psi(\theta)$$

Cramer Rao Lower Bound states that for every unbiased $T(X)$ the $cov_{\theta}(T(X))$ satisfies

$$cov_{\theta}(T(X)) \ge \frac{\partial\psi(\theta)}{\partial\theta}I^{-1}(\theta)\Big(\frac{\partial\psi(\theta)}{\partial\theta}\Big)^T = B(\theta)$$

where $A \ge B$ for matrices means that $A - B$ is positive semi-definite, $\frac{\partial\psi(\theta)}{\partial\theta}$ is simply a Jacobian $J_{i,j}(\psi)$. Note that if we estimate $\theta$, that is $\psi(\theta) = \theta$, above simplifies to

$$cov_{\theta}(T(X)) \ge I^{-1}(\theta)$$

But what does it tell us really? For example, recall that

$$var_{\theta}(T_i(X)) = [cov_{\theta}(T(X))]_{i,i}$$

and that for every positive semi-definite matrix $A$ diagonal elements are non-negative

$$\forall_i\ A_{i,i} \ge 0$$

From above we can conclude that the variance of each estimated element is bounded by diagonal elements of matrix $B(\theta)$

$$\forall_i\ var_{\theta}(T_i(X)) \ge [B(\theta)]_{i,i}$$

So CRLB doesn't tell us the variance of our estimator, but wheter or not our estimator is optimal, i.e. if it has lowest covariance among all unbiased estimators.