PCA – Calculating Mahalanobis Distance via PCA When n < p

correlationcovariancedistance-functionsgeneticspca

I have an $n\times p$ matrix, where $p$ is the number of genes and $n$ is the number of patients. Anyone whose worked with such data knows that $p$ is always larger than $n$. Using feature selection I have gotten $p$ down to a more reasonable number, however $p$ is still greater than $n$.

I would like to compute the similarity of the patients based on their genetic profiles; I could use the euclidean distance, however Mahalanobis seems more appropriate as it accounts for the correlation among the variables. The problem (as noted in this post) is that Mahalanobis distance, specifically the covariance matrix, doesn't work when $n < p$. When I run Mahalanobis distance in R, the error I get is:

 Error in solve.default(cov, ...) :    system is computationally
 singular: reciprocal condition number = 2.81408e-21

So far to try solve this, I've used PCA and instead of using genes, I use components and this seems to allow me to compute the Mahalanobis distance; 5 components represent about 80% of the variance, so now $n > p$.

My questions are: Can I use PCA to meaningfully get the Mahalanobis distance between patients, or is it inappropriate? Are there alternative distance metrics that work when $n < p$ and there is also much correlation among the $n$ variables?

Best Answer

If you keep all the components from a PCA - then the Euclidean distances between patients in the new PCA-space will equal their Mahalanobis distances in the observed-variable space. If you'll skip some components, that will change a little, but anyway. Here I refer to to unit-variance PCA-components, not the kind whose variance is equal to eigenvalue (I am not sure about your PCA implementation).

I just mean, that if you want to evaluate Mahalanobis distance between the patients, you can apply PCA and evaluate Euclidean distance. Evaluating Mahalanobis distance after applying PCA seems something meaningless to me.

Related Question