PCA Model-Selection – Selecting PCA Models Using AIC or BIC Methods

model selectionpca

I want to use the Akaike Information Criterion (AIC) to choose the appropriate number of factors to extract in a PCA. The only issue is that I'm not sure how to determine the number of parameters.

Consider a $T\times N$ matrix $X$, where $N$ represents the number of variables and $T$ the number of observations, such that $X\sim \mathcal N\left(0,\Sigma\right)$. Since the covariance matrix is symmetric, then a maximum likelihood estimate of $\Sigma$ could set the number of parameters in the AIC equal to $\frac{N\left(N+1\right)}{2}$.

Alternatively, in a PCA, you could extract the first $f$ eigenvectors and eigenvalues of $\Sigma$, call them $\beta_{f}$ and $\Lambda_{f}$ and then calculate $$\Sigma=\beta_{f}\Lambda_{f}\beta_{f}'+I\sigma_{r}^{2}$$
where $\sigma_{r}^{2}$ is the average residual variance. By my count, if you have $f$ factors, then you would $f$ parameters in $\Lambda_{f}$, $Nf$ parameters in $\beta_{f}$, and $1$ parameter in $\sigma_{r}^{2}$.

Is this approach correct? It seems like it would lead to more parameters than the maximum likelihood approach as the number of factors increases to $N$.

Best Answer

The works of Minka (Automatic choice of dimensionality for PCA, 2000) and of Tipping & Bishop (Probabilistic Principal Component Analysis) regarding a probabilistic view of PCA might provide you with the framework you interested in. Minka's work provides an approximation of the log-likelihood $\mathrm{log}\: p(D|k)$ where $k$ is the latent dimensionality of your dataset $D$ by using a Laplace approximation; as stated explicitly : "A simplification of Laplace's method is the BIC approximation."

Clearly this takes a Bayesian viewpoint of your problem that is not based on the information theory criteria (KL-divergence) used by AIC.

Regarding the original "determination of parameters' number" question I also think @whuber's comment carries the correct intuition.

Related Solutions

Model Selection Using AIC with Equally Weighted Models

There was a fairly good commentary in the Journal of Wildlife Management concerning uninformative parameters within the AIC framework.

Arnold, T. W. 2010. Uninformative parameters and model selection using Akaike’s Information Criterion. Journal of Wildlife Management 74:1175–1178. [Link].

We usually consider models within 2 delta AIC as competitive. However, if a model has an addition of only one parameter to its competitor and that parameter is not significant, that parameter is likely spurious. AIC = –2LL + 2K so the penalty for adding one parameter is +2 AIC. If only one parameter is added but the AIC is within 2 delta AIC, the model fit was not improved enough to overcome the penalty. Therefore, that parameter is uninformative and should not be included in the model or interpreted as having an effect.

Principal Component Analysis – Why Use PCA of Data by Means of SVD

Here are my 2ct on the topic

The chemometrics lecture where I first learned PCA used solution (2), but it was not numerically oriented, and my numerics lecture was only an introduction and didn't discuss SVD as far as I recall.
If I understand Holmes: Fast SVD for Large-Scale Matrices correctly, your idea has been used to get a computationally fast SVD of long matrices.
That would mean that a good SVD implementation may internally follow (2) if it encounters suitable matrices (I don't know whether there are still better possibilities). This would mean that for a high-level implementation it is better to use the SVD (1) and leave it to the BLAS to take care of which algorithm to use internally.

Quick practical check: OpenBLAS's svd doesn't seem to make this distinction, on a matrix of 5e4 x 100, svd (X, nu = 0) takes on median 3.5 s, while svd (crossprod (X), nu = 0) takes 54 ms (called from R with microbenchmark).
The squaring of the eigenvalues of course is fast, and up to that the results of both calls are equvalent.

timing  <- microbenchmark (svd (X, nu = 0), svd (crossprod (X), nu = 0), times = 10)
timing
# Unit: milliseconds
#                      expr        min         lq    median         uq        max neval
#            svd(X, nu = 0) 3383.77710 3422.68455 3507.2597 3542.91083 3724.24130    10
# svd(crossprod(X), nu = 0)   48.49297   50.16464   53.6881   56.28776   59.21218    10

update: Have a look at Wu, W.; Massart, D. & de Jong, S.: The kernel PCA algorithms for wide data. Part I: Theory and algorithms , Chemometrics and Intelligent Laboratory Systems , 36, 165 - 172 (1997). DOI: http://dx.doi.org/10.1016/S0169-7439(97)00010-5

This paper discusses numerical and computational properties of 4 different algorithms for PCA: SVD, eigen decomposition (EVD), NIPALS and POWER.

They are related as follows:

computes on      extract all PCs at once       sequential extraction    
X                SVD                           NIPALS    
X'X              EVD                           POWER

The context of the paper are wide $\mathbf X^{(30 \times 500)}$, and they work on $\mathbf{XX'}$ (kernel PCA) - this is just the opposite situation as the one you ask about. So to answer your question about long matrix behaviour, you need to exchange the meaning of "kernel" and "classical".

performance comparison

Not surprisingly, EVD and SVD change places depending on whether the classical or kernel algorithms are used. In the context of this question this means that one or the other may be better depending on the shape of the matrix.

But from their discussion of "classical" SVD and EVD it is clear that the decomposition of $\mathbf{X'X}$ is a very usual way to calculate the PCA. However, they do not specify which SVD algorithm is used other than that they use Matlab's svd () function.

    > sessionInfo ()
    R version 3.0.2 (2013-09-25)
    Platform: x86_64-pc-linux-gnu (64-bit)

    locale:
     [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8   
     [6] LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
    [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     

    other attached packages:
    [1] microbenchmark_1.3-0

loaded via a namespace (and not attached):
[1] tools_3.0.2

$ dpkg --list libopenblas*
[...]
ii  libopenblas-base              0.1alpha2.2-3                 Optimized BLAS (linear algebra) library based on GotoBLAS2
ii  libopenblas-dev               0.1alpha2.2-3                 Optimized BLAS (linear algebra) library based on GotoBLAS2

Best Answer

Related Solutions

Model Selection Using AIC with Equally Weighted Models

Principal Component Analysis – Why Use PCA of Data by Means of SVD

Related Question