In general, how many points are needed to estimate a p-dimensional covariance matrix? Does it depend on how the data are spread out across the different dimensions? Does it depend on the true distribution of the data? Thank you!
Solved – How many samples are needed to estimate a p-dimensional covariance matrix
covariancemathematical-statistics
Related Solutions
(The answer below merely introduces and states the theorem proven in Eq. (0) The beauty in that paper is that most of the arguments are made in terms of basic linear algebra. To answer this question it will be enough to state the main results, but by all means, go check the original source).
In any situation where the multivariate pattern of the data can be described by a $k$-variate elliptical distribution, statistical inference will, by definition, reduce it to the problem of fitting (and characterizing) a $k$-variate location vector (say $\boldsymbol\theta$) and a $k\times k$ symmetric semi-positive definite (SPSD) matrix (say $\boldsymbol\varSigma$) to the data. For reasons explained below (which are assumed as premises) it will often be more meaningful to decompose $\boldsymbol\varSigma$ into its shape component (a SPSD matrix of the same size as $\boldsymbol\varSigma$) accounting for the shape of the density contours of your multivariate distribution and a scalar $\sigma_S$ expressing the scale of these contours.
In univariate data ($k=1$), $\boldsymbol\varSigma$, the covariance matrix of your data is a scalar and, as will follow from the discussion below, the shape component of $\boldsymbol\varSigma$ is 1 so that $\boldsymbol\varSigma$ equals its scale component $\boldsymbol\varSigma=\sigma_S$ always and no ambiguity is possible.
In multivariate data, there are many possible choices for scaling functions $\sigma_S$. One in particular ($\sigma_S=|\pmb\varSigma|^{1/k}$) stands out in having a key desirable propriety, making it the preferred choice of scaling functions in the context of elliptical families.
Many problems in MV-statistics involve estimation of a scatter matrix, defined as a function(al) SPSD matrix in $\mathbb{R}^{k\times k}$ ($\boldsymbol\varSigma$) satisfying:
$$(0)\quad\boldsymbol\varSigma(\boldsymbol A\boldsymbol X+\boldsymbol b)=\boldsymbol A\boldsymbol\varSigma(\boldsymbol X)\boldsymbol A^\top$$ (for non singular matrices $\boldsymbol A$ and vectors $\boldsymbol b$). For example the classical estimate of covariance satisfies (0) but it is by no means the only one.
In the presence of elliptical distributed data, where all the density contours are ellipses defined by the same shape matrix, up to multiplication by a scalar, it is natural to consider normalized versions of $\boldsymbol\varSigma$ of the form:
$$\boldsymbol V_S = \boldsymbol\varSigma / S(\boldsymbol\varSigma)$$
where $S$ is a 1-honogenous function satisfying:
$$(1)\quad S(\lambda \boldsymbol\varSigma)=\lambda S(\boldsymbol\varSigma) $$
for all $\lambda>0$. Then, $\boldsymbol V_S$ is called the shape component of the scatter matrix (in short shape matrix) and $\sigma_S=S^{1/2}(\boldsymbol\varSigma)$ is called the scale component of the scatter matrix. Examples of multivariate estimation problems where the loss function only depends on $\boldsymbol\varSigma$ through its shape component $\boldsymbol V_S$ include tests of sphericity, PCA and CCA among others.
Of course, there are many possible scaling functions so this still leaves the open the question of which (if any) of several choices of normalization function $S$ are in some sense optimal. For example:
- $S=\text{tr}(\boldsymbol\varSigma)/k$ (for example the one proposed by @amoeba in his comment below the OP's question as well as @HelloGoodbye's answer below. See also [1], [2], [3])
- $S=|\boldsymbol\varSigma|^{1/k}$ ([4], [5], [6], [7], [8])
- $\boldsymbol\varSigma_{11}$ (the first entry of the covariance matrix)
- $\lambda_1(\boldsymbol\varSigma)$ (the first eigenvalue of $\boldsymbol\varSigma$), this is called the spectral norm and is discussed in @Aksakal answer below.
Among these, $S=|\boldsymbol\varSigma|^{1/k}$ is the only scaling function for which the Fisher Information matrix for the corresponding estimates of scale and shape, in locally asymptotically normal families, are block diagonal (that is the scale and shape components of the estimation problem are asymptotically orthogonal) [0]. This means, among other things, that the scale functional $S=|\boldsymbol\varSigma|^{1/k}$ is the only choice of $S$ for which the non specification of $\sigma_S$ does not cause any loss of efficiency when performing inference on $\boldsymbol V_S$.
I do not know of any comparably strong optimality characterization for any of the many possible choices of $S$ that satisfy (1).
- [0] Paindaveine, D., A canonical definition of shape, Statistics & Probability Letters, Volume 78, Issue 14, 1 October 2008, Pages 2240-2247. Ungated link
- [1] Dumbgen, L. (1998). On Tyler’s M-functional of scatter in high dimension, Ann. Inst. Statist. Math. 50, 471–491.
- [2] Ollila, E., T.P. Hettmansperger, and H. Oja (2004). Affine equivariant multivariate sign methods. Preprint, University of Jyvaskyla.
- [3] Tyler, D.E. (1983). Robustness and efficiency properties of scatter matrices, Biometrika 70, 411–420.
- [4] Dumbgen, L., and D.E. Tyler (2005). On the breakdown properties of some multivariate M-Functionals, Scand. J. Statist. 32, 247–264.
- [5] Hallin, M. and D. Paindaveine (2008). Optimal rank-based tests for homogeneity of scatter, Ann. Statist., to appear.
- [6] Salibian-Barrera, M., S. Van Aelst, and G. Willems (200 6). Principal components analysis based on multivariate MM-estimators with fast and robust bootstrap, J. Amer. Statist. Assoc. 101, 1198–1211.
- [7] Taskinen, S., C. Croux, A. Kankainen, E. Ollila, and H. O ja (2006). Influence functions and efficiencies of the canonical correlation and vector estimates based on scatter and shape matrices, J. Multivariate Anal. 97, 359–384.
- [8] Tatsuoka, K.S., and D.E. Tyler (2000). On the uniqueness of S-Functionals and M-functionals under nonelliptical distributions, Ann. Statist. 28, 1219–1243.
Best Answer
As @whuber has commented (+1 to his comment) you need "$p$ data points in $p$ dimensions" under the assumption those points are independent. Nevertheless as you correctly recognise this will depended on the underlying distribution of your data as well as your sample size. That is because in finite samples sizes you have finite sample effects. That means that, in very approximate manner, your random sample exhibits properties (usually regularities in the form of collinearity) that should not be there. This is not something new: Finite or small sample corrections are something that is done ubiquitously in Statistics, for example the very popular Akaike Information Criterion (AIC) has a very easy to compute version that is corrected for finite samples: the AICc (that is unfortunately underused - AICc corrects for deviations from normalities, not collinearities). So how bad things might be?
[A quick note: A singular covariance matrix is essentially one that is not positive definite (PD). You can check if a matrix is PD by checking if it has a Cholesky decomposition. That is much faster than using eigendecomposition or SVD.]
Let's say is looking at a random sample $S_1$ such that $s_1 \sim U[0,1]$ and another sample $S_2$ such that $s_2 \sim T_{\nu=1}$ ($\nu$ being the degrees of freedom). How would thing go with a relatively small (<200) sample? Well... not splendid! Let's simulate something like this (in MATLAB):
About 50% of the time you will get a non-PD matrix. That's definitely not good. What about if we had $p+1$ samples though?
Yeap we are fine! Why all this trouble though?
Let see the $p \times p$ version first: When you are saying that a matrix is non-invertible or it has zero eigenvalues you are saying it is rank deficient. That is because the rank of a covariance matrix can be thought as the number of non-zero eigenvalues. In addition, when you compute the product $S_x^T S_x$ you can get at most a matrix of rank $p$ because the rank of a product of matrices is at most equal to the rank of any matrix in the product. That means that the covariance matrix of your $p$-dimensional sample has at most rank $p$. Therefore for any number of samples $N$, the rank of $S_x$ is at most $p$. From this we can assume that even small deviations from complete randomness would make our covariance matrix rank deficient (as seen in the first simulation). OK, fine why does $p+1$ seems to work so nicely?
Now let's see the $p+1 \times p$ version: Think of how you estimate a sample covariance matrix: While in a quick manner we can write : $K = \frac{1}{N-1} S_x^T S_X$ (because we assumed $S_x$ to have mean 0, we should properly write things as: $K = \frac{1}{N-1} \Sigma_{i=1}^N (S_{x(i)} - \hat{\mu})(S_{x(i)} - \hat{\mu})^T$ where $\hat{\mu}$ is the sample mean. But what about the rank of this $(S_{x(i)} - \hat{\mu})(S_{x(i)} - \hat{\mu})^T$ matrix? Well... that is 1! Not only that but exactly because we subtracted $\hat{\mu}$ we reduced the original rank of the matrix $S_x$ to begin with! So as we add up points ($N$ gets larger) we have more chances to have a full-rank matrix. For the current case, exactly the covariance in our sample is completely diagonal, just adding another point ensure that it was extremely unlikely (not guaranteed) to have a rank deficiency.