Solved – How many samples are needed to estimate a p-dimensional covariance matrix

covariancemathematical-statistics

In general, how many points are needed to estimate a p-dimensional covariance matrix? Does it depend on how the data are spread out across the different dimensions? Does it depend on the true distribution of the data? Thank you!

Best Answer

As @whuber has commented (+1 to his comment) you need "$p$ data points in $p$ dimensions" under the assumption those points are independent. Nevertheless as you correctly recognise this will depended on the underlying distribution of your data as well as your sample size. That is because in finite samples sizes you have finite sample effects. That means that, in very approximate manner, your random sample exhibits properties (usually regularities in the form of collinearity) that should not be there. This is not something new: Finite or small sample corrections are something that is done ubiquitously in Statistics, for example the very popular Akaike Information Criterion (AIC) has a very easy to compute version that is corrected for finite samples: the AICc (that is unfortunately underused - AICc corrects for deviations from normalities, not collinearities). So how bad things might be?

[A quick note: A singular covariance matrix is essentially one that is not positive definite (PD). You can check if a matrix is PD by checking if it has a Cholesky decomposition. That is much faster than using eigendecomposition or SVD.]

Let's say is looking at a random sample $S_1$ such that $s_1 \sim U[0,1]$ and another sample $S_2$ such that $s_2 \sim T_{\nu=1}$ ($\nu$ being the degrees of freedom). How would thing go with a relatively small (<200) sample? Well... not splendid! Let's simulate something like this (in MATLAB):

rng(1234)
N = 200;
M = 200;
FailsU = zeros(N,M);
FailsT = zeros(N,M);

for i = 1:M
    for j = 1:N;
        K = cov(rand(i));
        [L,p] = chol(K,'lower');
        if(p)
            FailsU(j,i) = 1;
        end
        K = cov(random('t',1,[i,i]));
        [L,p] = chol(K,'lower');
        if(p)
            FailsT(j,i) = 1;
        end
    end
end

enter image description here

About 50% of the time you will get a non-PD matrix. That's definitely not good. What about if we had $p+1$ samples though?

% Same initializations as above    

for i = 1:M
    for j = 1:N;
        K = cov(rand(i+1,i));
        [L,p] = chol(K,'lower');
        if(p)
            Fails1(j,i) = 1;
        end
        K = cov(random('t',1,[i+1,i]));
        [L,p] = chol(K,'lower');
        if(p)
            FailsT1(j,i) = 1;
        end
    end
end

enter image description here

Yeap we are fine! Why all this trouble though?

Let see the $p \times p$ version first: When you are saying that a matrix is non-invertible or it has zero eigenvalues you are saying it is rank deficient. That is because the rank of a covariance matrix can be thought as the number of non-zero eigenvalues. In addition, when you compute the product $S_x^T S_x$ you can get at most a matrix of rank $p$ because the rank of a product of matrices is at most equal to the rank of any matrix in the product. That means that the covariance matrix of your $p$-dimensional sample has at most rank $p$. Therefore for any number of samples $N$, the rank of $S_x$ is at most $p$. From this we can assume that even small deviations from complete randomness would make our covariance matrix rank deficient (as seen in the first simulation). OK, fine why does $p+1$ seems to work so nicely?

Now let's see the $p+1 \times p$ version: Think of how you estimate a sample covariance matrix: While in a quick manner we can write : $K = \frac{1}{N-1} S_x^T S_X$ (because we assumed $S_x$ to have mean 0, we should properly write things as: $K = \frac{1}{N-1} \Sigma_{i=1}^N (S_{x(i)} - \hat{\mu})(S_{x(i)} - \hat{\mu})^T$ where $\hat{\mu}$ is the sample mean. But what about the rank of this $(S_{x(i)} - \hat{\mu})(S_{x(i)} - \hat{\mu})^T$ matrix? Well... that is 1! Not only that but exactly because we subtracted $\hat{\mu}$ we reduced the original rank of the matrix $S_x$ to begin with! So as we add up points ($N$ gets larger) we have more chances to have a full-rank matrix. For the current case, exactly the covariance in our sample is completely diagonal, just adding another point ensure that it was extremely unlikely (not guaranteed) to have a rank deficiency.

Related Question