PCA Sample Size – Minimum Sample Size for PCA or Factor Analysis

factor analysispcasample-size

If I have a dataset with $n$ observations and $p$ variables (dimensions), and generally $n$ is small ($n=12-16$), and $p$ may range from small ($p = 4-10$) to perhaps much larger ($p= 30-50$).

I remember learning that $n$ should be much larger than $p$ in order to run principal component analysis (PCA) or factor analysis (FA), but it seems like this may not be so in my data. Note that for my purposes I am rarely interested in any principal components past PC2.

Questions:

  1. What are the rules of thumb for minimum sample size when PCA is OK to use, and when it is not?
  2. Is it ever OK to use the first few PCs even if $n=p$ or $n<p$?
  3. Are there any references on this?
  4. Does it matter if your main goal is to use PC1 and possibly PC2 either:

    • simply graphically, or
    • as synthetic variable then used in regression?

Best Answer

You can actually measure whether your sample size is "large enough". One symptom of small sample size being too small is instability.

Bootstrap or cross validate your PCA: these techniques disturb your data set by deleting/exchanging a small fraction of your sample and then build "surrogate models" for each of the disturbed data sets. If the surrogate models are similar enough (= stable), you are fine. You'll probably need to take into account that the solution of the PCA is not unique: PCs can flip (multiply both a score and the respective principal component by $-1$). You may also want to use Procrustes rotation, to obtain PC models that are as similar as possible.

Related Question