Solved – How to select the optimal number principal components in functional principal components analysis

functional-data-analysismultivariate analysisr

I am interested in the selecting the optimal number of principal components in functional principal component analysis (FPCA). There are various techniques to do that for example AIC, BIC etc. (I am not much familiar with all of those). One of them is to choose this parameter in such a way that 84% variance is explained. This can be done in R-package fda using pca.fd(fdobj, nharm = .......)
Here least value of nharm is selected such that sum of variation proportion which is calculated by varprop is just greater than or equal to 0.84. I have less knowledge about other methods. I have also gone through this link. But it does not serve my objective.

Your suggestions would be very helpful in this direction. I would happy to use R for this problem.

Best Answer

Neither PCA nor FDA are configured to answer that question. PCA and FDA transform the full data set to another set of the same dimension.

Intuitively, we imagine that the data depend on a small number of vectors, and that the rest of the variation in the sample is noise. However, if you attempt to formulate this intuition and solve it, what you get is factor analysis, not PCA.

Therefore, using PCA to reduce the dimension of the problem always relies on rules of thumb and ad hoc thinking. To me, I would look at the proportion of total variance explained. I would also look at the coefficients to see if they had an obvious and meaningful interpretation, and stop when the eigenvectors stop making sense.

There is a useful function in the psych package, fa.parallel, that uses a graphical method to determine the number of components for PCA and FA. Again, it's a rule of thumb, but it seems to produce sensible results most of the time.

I would expect the number of components selected for PCA to be the same, or similar, to the number of components selected for FDA. FDA is sort of like working with an oblique transformation of the basis of the data space, which shouldn't impact the underlying dimensionality of the problem.

Related Solutions

Solved – time series decomposition/dtrending using splines

Could you use constrained B-splines from the R library cobs?

co <- cobs(x, y, lambda=-1)

Principal Components Analysis vs Correspondence Analysis – A Comparative Guide

PCA works on the values where as CA works on the relative values. Both are fine for relative abundance data of the sort you mention (with one major caveat, see later). With % data you already have a relative measure, but there will still be differences. Ask yourself

do you want to emphasise the pattern in the abundant species/taxa (i.e. the ones with large %cover), or
do you want to focus on the patterns of relative composition?

If the former, use PCA. If the latter use CA. What I mean by the two questions is would you want

A = {50, 20, 10}
B = { 5,  2,  1}

to be considered different or the the same? A and B are two samples and the values are the %cover of three taxa shown. (This example turned out poorly, assume there is bare ground! ;-) PCA would consider these very different because of the Euclidean distance used, but CA would consider these two samples as being very similar because the have the same relative profile.

The big caveat here is the closed compositional nature of the data. If you have a few groups (Sand, Silt, Clay, for example) that sum to 1 (100%) then neither approach is correct and you could move to a more appropriate analysis via Aitchison's Log-ratio PCA which was designed for closed compositional data. (IIRC to do this you need to centre by rows and columns, and log transform the data.) There are other approaches too. If you use R, then one book that would be useful is Analyzing Compositional Data with R.

Best Answer

Related Solutions

Solved – time series decomposition/dtrending using splines

Principal Components Analysis vs Correspondence Analysis – A Comparative Guide

Related Question