Solved – Is PCA appropriate when $n

pcasmall-sample

This question is an extension of one I asked a few weeks back:
Minimum sample size for PCA or FA when the main goal is to estimate only few components?

I will restate that I am interested in the use of PCA in situations where $n \le p$; and generally am only interested in using the first few PC axes for descriptive purposes or as "synthetic" variables that reduce several dimensions into one.

My question today revolves around a text "Numerical Ecology", 3rd edition by Legendre & Legendre. On page 450, they state:

A full-rank dispersion matrix $\mathbf S$ [variance-covariance] cannot be estimated using a number of observations $n$ smaller than or equal to the number of descriptors $p$. When $n \le p$, since there are $n-1$ DF in total, the rank of the resulting $\mathbf S$ matrix of order $p$ is $(n-1)$. In such a case, the eigen-decomposition of $\mathbf S$ produces $(n-1)$ real and $p-(n-1)$ null eigenvalues. Positioning $n$ objects while respecting their distances requires $(n-1)$ dimensions only. A PCA where $n \le p$ produces $(n-1)$ eigenvalues larger than $0$ and the $(n-1)$ corresponding eigenvectors and principle components."

In other words, I believe they are implying that it is OK to use PCA on a dataset where $n \le p$ as long as you are only interested in using $(n-1)$ or fewer of the PCs (as I am).

I am interested in your opinion regarding this (their claim and my interpretation) if you have one; and would appreciate any additional literature that might corroborate this claim.

Best Answer

Yes, you surely can do that. I don’t know applications in ecology, but you may be interested to know that this is widely used in genetics (epidemiology and population genetics), with $n \ll p$, typically $n = 1000$ or $5000$ individuals and $p = 500\,000$ genotypes.

To adjust analyses for population mixture, the first 10 or 50 PC are used. The first two PC give already lots of informations, as shown in Novembre J (2008). Pay special attention to figure 1 where you see that the two first PC obtained from genomic data retrieve roughly the spacial arrangement off populations within Europe.