Solved – Intuition as to why estimates of a covariance matrix are numerically unstable

covarianceestimationlinear algebra

It is well known that estimating the covariance matrix by the ML estimator (sample covariance) can be very numerically unstable (in high dimensions), which is why it is preferable to do PCA with the SVD, for example. But I haven't been able to find an intuitive explanation of why this is the case.

For matrix inversions it is clearer where the numerical instability arises, but a covariance matrix for (centered and standardized data) is just the seemingly innocuous product: XX'

EDIT: here is a reference: Regularized estimation of large covariance matrices

Best Answer

The reason that the SVD of the original matrix $X$ is preferred instead of the eigen-decomposition of the covariance matrix $C$ when doing PCA is that that the solution of the eigenvalue problem presented in the covariance matrix $C$ (where $C = \frac{1}{N-1}X_0^T X_0$, $X_0$ being the zero-centred version of the original matrix $X$) has a higher condition number than the corresponding problem presented by the original data matrix $X$. In short, the condition number of a matrix quantifies the sensitivity of the solution of a system of linear equations defined by that matrix to errors in the original data. The condition number strongly suggests (but does not fully determine) the quality of the system of linear equations' solution.

In particular as the covariance matrix $C$ is calculated by the cross-product of $X_0$ with itself, the ratio of the largest singular value of $X_0$ to the smallest singular value of $X_0$ is squared. That ratio is the condition number; values that are close to unity or generally below a few hundreds suggest a rather stable system. This is easy to see as follows:

Assume that $X_0 = USV^T$ where $U$ are the right singular vectors, $V$ are the left singular vectors and $S$ is the diagonal matrix holding the singular values of $X_0$, as $C = \frac{1}{N-1}X_0^TX_0$ then we can write: $C = \frac{1}{N-1} VS^TU^T USV^T = \frac{1}{N-1} V S^T S V^T = \frac{1}{N-1} V \Sigma V^T$. (Remember that the matrix $U$ is orthonormal so $U^TU = I$). ie. the singular values of $X_0^TX_0$ represented in $\Sigma$ are the square of the singular values of $X_0$ represented in $S$.

As you see while seemingly innocuous the cross-product $X_0^TX_0$ squares the condition number of the system you try to solve and thus makes the resulting system of equations (more) prone to numerical instability issues.

Some additional clarification particular to the paper linked: the estimate of the covariance matrix $C$ is immediately rank-degenerate in cases where $N < p $ which are the main focus of that paper; that's why the authors initially draw attention of the Marcenko–Pastur law about the distribution of singular values) and regularisation and banding techniques. Without such notions, working with $C$ or the inverse of $C$ (in the form of Cholesky factor of the inverse of $C$ as the authors do) is numerically unstable. The rationale as to why these covariance matrices are degenerate is exactly the same as above in the case of very large matrices: the condition number is squared. This is even more prominent in the $N < p$ case: an $N\times p$ matrix $X$ has at most $N$ non-zero singular values, the crossproduct of it with itself can also have at most $N$ non-zero singular values leading to rank-degeneracy (and therefore a "infinite" condition number). The paper presents a way to band the estimated covariance matrix given some particular conditions (the estimated $C$ has a Toepliz structure, the oracle $k$ representing the banding parameter can properly estimated, etc.) such as it is numerically stable.

Best Answer

Related Solutions

Solved – What does the inverse of covariance matrix say about data? (Intuitively)

Solved – Why does $r^2$ between two variables represent proportion of shared variance

Principles and their implications

The results

Related Question