Solved – Estimation of the covariance matrix

covarianceelastic netestimationestimators

Assume we have $n$ iid random vectors $y_1, y_2, …, y_n$, normally distributed with zero mean and unknown covariance matrix $M$. Each vector is of size $p$.
I know a lot of methods that provide a sparse estimation of the true unknown covariance matrix.

My question: Consider the case when $p\gg n$. Is it possible to have an estimate covariance matrix which is better than the true one?
I mean by better that I can use the new estimate covariance instead of the true one (assumed to be known now) in specific applications such as hyperspectral image detection, classification, etc.

For example: Take the article of Ledoit and Wolf entitled "A well-conditioned estimator for large-dimensional covariance matrices". They developed a new estimator which is the weighted average between the sample covariance and the identity matrix. They mentioned in the article that the new estimator is more accurate than either of them (page 2). So if we consider that the true covariance matrix, which is unknown, is indeed the identity (just create this hypothesis), we can expect that the new estimator of Ledoit and Wolf is better.
Is it not logical what I am assuming?

Any help will be very appreciated!

Best Answer

What you are describing sounds to me as ridge-regression or Tikhonov regularization. You add a ridge to the diagonal, i.e. a scaled identity matrix.

The problem is that if you have more variables than observations, i.e. $p>>n$, you cannot estimate the parameters in some models, e.g. a linear model. If you have a model: $$ \mathbf{y} = \mathbf{X}\beta + \epsilon $$ Where $\mathbf{y}$ is $n\times 1$ and $\mathbf{X}$ is $n\times p$. Now the estimate of $\beta$ is of the form: $$ \hat{\beta} = (\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^T \mathbf{y} $$

Note that the matrix $\mathbf{X}^T \mathbf{X}$ will be rank deficient if $p>>n$, (here the matrix $\sigma^2(\mathbf{X}^T \mathbf{X})^{-1}$ is the covariance matrix of the parameters $\beta$). Thus we need to add some form of regularization to be able to get a solution, because that requires us to invert this matrix. One such type is the one you mention.

So this is better, because we cannot get any estimate unless we throw away some of the variables or add some form of regularization.

EDIT: To address your question of if this estimate presented in the paper is better than the true covariance matrix you should read over the conlusions in the paper:

In this paper, we have discussed the estimation of large-dimensional covariance matrices where the number of (iid) variables is not small compared to the sample size. It is well-known that in such situations the usual estimator, the sample covariance matrix, is ill-conditioned and may not even be invertible. The approach suggested is to shrink the sample covariance matrix towards the identity matrix, which means to consider a convex linear combination of these two matrices. The practical problem is to determine the shrinkage intensity, that is, the amount of shrinkage of the sample covariance matrix towards the identity matrix. To solve this problem, we considered a general asymptotics framework where the number of variables is allowed to tend to infinity with the sample size. It was seen that under mild conditions the optimal shrinkage intensity then tends to a limiting constant; here, optimality is meant with respect to a quadratic loss function based on the Frobenius norm. It was shown that the asymptotically optimal shrinkage intensity can be estimated consistently, which leads to a feasible estimator. Both the asymptotic results and the extensive Monte-Carlo simulations presented in this paper indicate that the suggested shrinkage estimator can serve as an all-purpose alternative to the sample covariance matrix. It has smaller risk and is better- conditioned. This is especially true when the dimension of the covariance matrix is large compared to the sample size

Thus, the estimate they provide is being compared to the sample covariance estimate. Not the true underlying covariance matrix.

EDIT2: The way the authors describe this as better, (on page 3 in the manuscript), refers to the condition number of the matrix. That means that their estimate is more numerically stable. This is usually the case when you perform any kind of regularization, since you are reducing the effective number of parameters that you are estimating.

Related Question