Solved – How to transform test data in Functional Principal Component Analysis in R

functional-data-analysispcar

I am doing a functional principal component analysis on time series data, and when I finished the FPCA on train data and extracted the PCs. Next, I need to project the test data onto the PCs, here I am frustrated how to carry out this process in R.

Here are my steps to deal with the time series data with fda package in R:

Construct the basis functions. create.bspline.basis
Smooth basis by smooth.basis
FPCA by pca.fd on train dataset

Till the step 3, I gained the score and varprop, but I have no idea how to tranform the test dataset onto the same PCs as in train data.

Thanks for your help in advance.

Best Answer

Projecting new functional data using an existing FPCA analysis is very similar to what we would do with standard PCA (for multivariate data). The main difference is that due to stochastic nature of our sampling procedure we are unable to use standard numerical integration as we would in the case of PCA to get the corresponding score but rather a probabilistic approximation of it (PACE - see reference below).

For rest of the post I will refer to $\phi$ as the functional PCs, $\xi$ as the associated FPC scores, $\lambda$ as their associated eigenvalues, $\mu$ as the sample mean and $C$ as the sample covariance. I also assume we are dealing with irregularly spaced data across a continuum $s$ and I refer to the test data at hand as $y_{test}$. In short, the prediction for the trajectory $y_i(s)$ using the first $K$ eigenfunctions is: $\hat{y}_i^K(s) = \hat{\mu}(s) + \sum_{k=1}^{K} \hat{\xi}_{i,k}\hat{\phi}_k(s)$.

In order to project new test data on the results of an existing FPCA we would require the following steps:

Ensure that $\mu$, $C$ and $\phi$ are evaluated at the same points of $s$ we have $y_{test}$ readings. If necessary, we estimate these values through interpolation techniques.
Centre the data to have $E\{\mu(s)\}=0$ according the $\hat\mu(s)$ we calculated during the original FPCA.
Predict the $\xi$ for the test data, using the fact that we expect the error of the prediction to be asymptotically Gaussian, through: $\hat{\xi}_{ik} = \hat{\lambda}_k \hat{\phi}_{ik}^T\hat{\Sigma}^{-1}_{y_i}(y_i^{obs} - \hat{\mu}_i)$. Notice that all estimates (aside $\hat{\lambda}_k$) are evaluated at the points we have observations from the $i$-th curve, i.e. they might even be just scalar in the odd case a particular sample has a single measurement. This whole procedure is what in the FDA literature is referred as the "PACE step/procedure" (PACE: Principal components Analysis through Conditional Expectation); the canonical reference on the matter is: Yao, et al. (2005) Functional Data Analysis for Sparse Longitudinal Data (Sect. 2.3 to be exact).

The package fdapace implements this methodology through the function predict.FPCA. The package fda (most probably) implements this methodology in the function project.basis but I have not used it.

Related Solutions

Solved – How to transform test set to the PCA space of the training set, if the features in train and test are different

What I did was discard all the words that are in the test set and that weren't on the train set, and rearrange everything so the order are the same in each matrix.

It can be seen in the following Python code. x_train is a pandas dataframe which contains the training text, and x_test is a pandas dataframe which contains the test text

#Train
tdm = txtm.TermDocumentMatrix()  
for doc in x_train:
    tdm.add_doc(doc) 
# Push the TDM data to a list of lists, then make that an ndarray, which then becomes a DataFrame.
tdm_rows = []
for row in tdm.rows(cutoff = 3): # The setting cutoff=1 means that words which appear in 1 or more documents will be included in the output
    tdm_rows.append(row)        
tdm_array = np.array(tdm_rows[1:])
tdm_terms = tdm_rows[0]
TDM_df_train = pd.DataFrame(tdm_array, columns = tdm_terms)
TDM_df_train = TDM_df_train.reindex_axis(sorted(TDM_df_train.columns), axis=1) #Ordena las columnas en orden alfabético
#Test
tdm = txtm.TermDocumentMatrix()  
for doc in x_test:
    tdm.add_doc(doc) 
# Push the TDM data to a list of lists, then make that an ndarray, which then becomes a DataFrame.
tdm_rows = []
for row in tdm.rows(cutoff = 3): # The setting cutoff=1 means that words which appear in 1 or more documents will be included in the output
    tdm_rows.append(row)        
tdm_array = np.array(tdm_rows[1:])
tdm_terms = tdm_rows[0]
TDM_df_test = pd.DataFrame(tdm_array, columns = tdm_terms)
#Remove from TDM_df_test words that aren't on TDM_df_train
for col in TDM_df_test:
   if col not in TDM_df_train.columns:
        del TDM_df_test[col]
TDM_df_test = TDM_df_test.reindex_axis(sorted(TDM_df_train.columns), axis=1, fill_value=0)

tfidf = TfidfTransformer()
tfidfRedTrain = tfidf.fit_transform(TDM_df_train.values)
tfidfRedTest = tfidf.fit_transform(TDM_df_test.values)

The last lines just turn the TDMs to TFIDF matrices.

This works perfectly fine with small data sets, but now I have a new trouble. Can this be done more efficiently? Because with another dataset I have with 2000 documents, this doesn't work, it just keeps training and never ends. Does anybody know a way to do this in Scala, Spark, Hadoop or something that is faster? Or can recommend me where to do it?

Solved – Off-diagonal elements of a correlation matrix after removing the first principal component

This is to be expected.

Your correlation matrix has mostly large positive elements (around 0.4 on average), as shown on your own histogram. In other words, all variables are correlated between each other and tend to vary together. This suggests that the correlation matrix has one large eigenvalue, far surpassing the rest, corresponding to the strong first principal component capturing this "overall" variation of the data. After this first PC is removed from the data, the remaining correlation matrix can be expected to have off-diagonal elements very much closer to zero.

Indeed, here is how the eigenvalue spectrum looks like in your case:

And here is how your correlation matrix looks (color scale from -1 to 1). On the left is the original matrix; in the middle is its rank-one approximation given by the first PC (i.e. the correlations as reconstructed by PC1); on the right is the residual correlations after the first PC is removed. And I show the corresponding histograms of the off-diagonal elements below.

As you see, the matrix has mostly positive elements (looking uniformly "orange") and can be well approximated by its first eigenvector only (middle). The residual is around zero.

Note that the mean of the off-diagonal elements after the first PC is removed is not exactly zero (-0.012 in this case). So it is not true that removing the first PC will achieve exact centering of the off-diagonal elements. But in cases like yours one can certainly expect it to happen. (It would be interesting to try to construct an example where it would not be the case; I don't have a ready solution to that.)

Best Answer

Related Solutions

Solved – How to transform test set to the PCA space of the training set, if the features in train and test are different

Solved – Off-diagonal elements of a correlation matrix after removing the first principal component

This is to be expected.

Related Question