Hessian Matrix vs Covariance Matrix – Understanding the Relationship

data miningmachine learningmathematical-statisticsmaximum likelihood

While I am studying Maximum Likelihood Estimation, to do inference in Maximum Likelihood Estimaion, we need to know the variance. To find out the variance, I need to know the Cramer's Rao Lower Bound, which looks like a Hessian Matrix with Second Deriviation on the curvature. I am kind of mixed up to define the relationship between covariance matrix and hessian matrix. Hope to hear some explanations about the question. A simple example will be appreciated.

Best Answer

You should first check out this Basic question about Fisher Information matrix and relationship to Hessian and standard errors

Suppose we have a statistical model (family of distributions) $\{f_{\theta}: \theta \in \Theta\}$. In the most general case we have $dim(\Theta) = d$, so this family is parameterized by $\theta = (\theta_1, \dots, \theta_d)^T$. Under certain regularity conditions, we have

$$I_{i,j}(\theta) = -E_{\theta}\Big[\frac{\partial^2 l(X; \theta)}{\partial\theta_i\partial\theta_j}\Big] = -E_\theta\Big[H_{i,j}(l(X;\theta))\Big]$$

where $I_{i,j}$ is a Fisher Information matrix (as a function of $\theta$) and $X$ is the observed value (sample)

$$l(X; \theta) = ln(f_{\theta}(X)),\text{ for some } \theta \in \Theta$$

So Fisher Information matrix is a negated expected value of Hesian of the log-probability under some $\theta$

Now let's say we want to estimate some vector function of the unknown parameter $\psi(\theta)$. Usually it is desired that the estimator $T(X) = (T_1(X), \dots, T_d(X))$ should be unbiased, i.e.

$$\forall_{\theta \in \Theta}\ E_{\theta}[T(X)] = \psi(\theta)$$

Cramer Rao Lower Bound states that for every unbiased $T(X)$ the $cov_{\theta}(T(X))$ satisfies

$$cov_{\theta}(T(X)) \ge \frac{\partial\psi(\theta)}{\partial\theta}I^{-1}(\theta)\Big(\frac{\partial\psi(\theta)}{\partial\theta}\Big)^T = B(\theta)$$

where $A \ge B$ for matrices means that $A - B$ is positive semi-definite, $\frac{\partial\psi(\theta)}{\partial\theta}$ is simply a Jacobian $J_{i,j}(\psi)$. Note that if we estimate $\theta$, that is $\psi(\theta) = \theta$, above simplifies to

$$cov_{\theta}(T(X)) \ge I^{-1}(\theta)$$

But what does it tell us really? For example, recall that

$$var_{\theta}(T_i(X)) = [cov_{\theta}(T(X))]_{i,i}$$

and that for every positive semi-definite matrix $A$ diagonal elements are non-negative

$$\forall_i\ A_{i,i} \ge 0$$

From above we can conclude that the variance of each estimated element is bounded by diagonal elements of matrix $B(\theta)$

$$\forall_i\ var_{\theta}(T_i(X)) \ge [B(\theta)]_{i,i}$$

So CRLB doesn't tell us the variance of our estimator, but wheter or not our estimator is optimal, i.e. if it has lowest covariance among all unbiased estimators.

Best Answer

Related Solutions

Solved – How to use the Hessian matrix for maximum likelihood estimation

Solved – Intuition for the “information matrix equality” result

Related Question