Solved – What to do when sample covariance matrix is not invertible

clusteringcovariancecovariance-matrixmatrix inversemultivariate analysis

I am working on some clustering techniques, where for a given cluster of d-dimension vectors I assume a multivariate normal distribution and calculate the sample d-dimensional mean vector and the sample covariance matrix.

Then when trying to decide if a new, unseen, d-dimensional vector belongs to this cluster I am checking its distance via this measure:
$$\left(X_i-\hat{\mu}_X\right)'\hat{\sigma}_X^{-1}\left(X_i-\hat{\mu}_X\right)>B_{0.95}\left(\frac{p}{2},\frac{-p}{2}\right)$$

Which requires me to calculate the inverse of the covariance matrix $\hat{\sigma}_X$. But given some samples I cannot guarantee that the covariance matrix will be invertible, what should I do in the case that it is not?

Thanks

Best Answer

If your samples dimensionality is less than the vector space dimensionality, singular matrices may arise. If you have less samples than $d+1$ (when $d$ is your dimensionality), this situation will even necessarily arise: $k+1$ samples span at most a $d$ dimensional hyperplane. Given such a small sample, you obviously cannot compute a variance in the orthogonal space.

This is why it's common to not use literal PCA, but instead perform singular value decomposition, which can be used to compute the pseudoinverse of a matrix. If the matrix is invertible, the pseudoinverse will be the inverse.

However, if you are seeing non-invertible matrixes, chances are that your distance from the cluster will be meaningless if the vector is outside of the hyperplane the cluster repesents, because you do not know the variance in the orthogonal space (you can think of this variance as 0!) SVD can compute the pseudoinverse, but the "variances" will still be not determined by your data.

In this case, you should probably have been doing global dimensionality reduction first. Increasing the sample size will only help when you actually have non-redundant dimensions: no matter how many samples you draw from a distributions with $y=x$, the matrix will always be non-invertible, and you will not be able to judge the deviation $x-y$ with respect to a standard deviation (which is 0).

Furthermore, depending on how you compute the covariance matrix, you might be running into numerical issues due to catastrophic cancellation. The simplest workaround is to always center the data first, to get zero mean.

Related Question