I used nlm function in R to do the optimization. When I calculated the correlation between estimated parameters using the inverse of Hessian matrix, I got negative values on the diagonal. My questions are what could be the reason for that issue, and what could be possible solution?

# Solved – Negative variance from inverse Hessian matrix

covariance-matrix

#### Related Solutions

What the operation $C^{-\frac{1}{2}}$ refers at is the *decorrelation* of the underlying sample to uncorrelated components; $C^{-\frac{1}{2}}$ is used as whitening matrix. This is natural operation when looking to analyse each column/source of the original data matrix $A$ (having a covariance matrix $C$), through an uncorrelated matrix $Z$. The most common way of implementing such whitening is through the Cholesky decomposition (where we use $C = LL^T$, see this thread for an example with "colouring" a sample) but here we use slightly less uncommon Mahalanobis whitening (where we use $C= C^{0.5} C^{0.5}$).
The whole operation in R would go a bit like this:

```
set.seed(323)
N <- 10000;
p <- 3;
# Define the real C
( C <- base::matrix( data =c(4,2,1,2,3,2,1,2,3), ncol = 3, byrow= TRUE) )
# Generate the uncorrelated data (ground truth)
Z <- base::matrix( ncol = 3, rnorm(N*p) )
# Estimate the colouring matrix C^0.5
CSqrt <- expm::sqrtm(C)
# "Colour" the data / usually we use Cholesky (LL^T) but using C^0.5 valid too
A <- t( CSqrt %*% t(Z) )
# Get the sample estimated C
( CEst <- round( digits = 2, cov( A )) )
# Estimate the whitening matrix C^-0.5
CEstInv <- expm::sqrtm(solve(CEst))
# Whiten the data
ZEst <- t(CEstInv %*% t(A) )
# Check that indeed we have whitened the data
( round( digits = 1, cov(cbind(ZEst, Z) ) ) )
```

So to succinctly answer the question raised:

- It means that we can decorrelate the sample $A$ that is associated with that covariance matrix $C$ in such way that we get uncorrelated components. This is commonly referred as whitening.
- The general Linear Algebra idea it assumes is that a (covariance) matrix can be used as a projection operator (to generate a correlated sample by "colouring") but so does the inverse of it (to decorrelate/"whiten" a sample).
- Yes, the easiest way to raise a valid covariance matrix to any power (the negative square root is just a special case) by using the eigen-decomposition of it; $C = V \Lambda V^T$, $V$ being an orthonormal matrix holding the eigenvectors of $C$ and $\Lambda$ being a diagonal matrix holding the eigenvalues. Then we can readily change the diagonal matrix $\Lambda$ as we wish and get the relevant result.

A small code snippet showcasing point 3.

```
# Get the eigendecomposition of the covariance matrix
myEigDec <- eigen(cov(A))
# Use the eigendecomposition to get the inverse square root
myEigDec$vectors %*% diag( 1/ sqrt( myEigDec$values) ) %*% t(myEigDec$vectors)
# Use the eigendecomposition to get the "negative half power" (same as above)
myEigDec$vectors %*% diag( ( myEigDec$values)^(-0.5) ) %*% t(myEigDec$vectors)
# And to confirm by the R library expm
solve(expm::sqrtm(cov(A)))
```

The result that you seem to be referring to relates the inverse of the Fisher information matrix to the *asymptotic* covariance matrix of the *maximum likelihood estimator* (MLE). Note that this is an *asymptotic* result, i.e. it applies only in the limit of infinite sample size, it is *not* a general property. More specifically, under some regularity conditions (which don't always hold), the distribution of the MLE converges in the limit of infinite sample size to a multivariate normal distribution with covariance matrix $I(\theta)^{-1}$.

This implies that the MLE is asymptotically *efficient*, i.e. it reaches the lowest possible variance for any unbiased estimator (see Cramer Rao bound and Maximum likelihood Efficiency ).

You can find the details of the proof in most classic statistics textbook, (like this one), as well as many lecture notes (search "maximum likelihood asymptotic efficiency") .

## Best Answer

The variance covariance matrix can be aproximated by the inverse of the negative Hessian

H(matrix of second order partial derivatives). May be the reason is that you are approximating using the inverse $H$, when it should be the inverse of the negativeH.