Solved – Principal components analysis on nested data

multilevel-analysisnested datapcar

I'm working on a piece of analysis that requires identifying a small set of variables that summarize the variation found in a larger set of principal observations on teacher practice. Given the nature of the question, it seems natural to use PCA in my analysis. My data set however, has a hierarchical structure (observation occasions at the first level, and teachers at the second). So, I have the following questions:

What do I risk by doing PCA without considering this temporal effect?
What methods are there for accounting for the nested structure of data in PCA? Are there any useful tools in R?

Best Answer

An important question is to evaluate how consistent the principal axes are across individual teachers. If they are mostly consistent, you can just use PCA on the whole dataset (ignoring teacher segregation) and be happy. If it happens that different teachers have fundamentally different axes of variance, then you likely have to reconsider the entire approach.

So the question I will try to answer is how to test if PCA are consistent across individuals. This is a solution I used some time ago, I have no guarantees that it is in any way optimal.

Step 1: Decide on the relative amount of explained variance $\alpha$ (e.g. 90%) that your PCA should cover

Step 2: For one teacher $i$, find the first $n$ PCA's that explain $\alpha$ variance. This will give you the eigenvalues $e_i$ and eigenvectors $V_i$. Importantly, the eigenvectors form the mapping that projects original data $x_i$ onto PCA space $\psi_{ii}$

$$\psi_{ii} = V_i^T x_i$$

Step 3: What we want to do is to calculate for another teacher $j$, how much of its variance would be explained by using the PCA space of teacher $i$. The prediction for original data given its PCA is

$$\hat{x}_{ii} = V_i\psi_{ii} = V_iV_i^Tx_i$$

and the relative explained variance is then

$$l^2_{ii} = \frac{||\hat{x}_{ii} - x_i||}{||x_i||} $$

where $||\cdot||$ is a 2-norm (aka euclidean distance). The same principle can be used to project the data of teacher $j$ onto basis of $i$

$$\hat{x}_{ij} = V_i \psi_{ij} = V_i V_i^T x_j$$

and the relative explained variance

$$l^2_{ij} = \frac{||\hat{x}_{ij} - x_j||}{||x_j||} $$

Step 4: Finally, we can repeat this procedure to obtain the matrix $l^2_{ij}$ for every pair of teachers $ij$. This matrix can be used to judge the extent to which a common PCA would capture individual variance, and find out detailed information on which clusters of teachers use similar variance axis, and which are outliers.

Related Solutions

PCA – Principal Components Analysis Creating Correlated Axes with Nested Data

user11852: your suggestion of doing a phylogenetic PCA worked beautifully thank you!

Interesting that the REvell paper and corresponding R package discuss this technique only in terms of correcting for relatedness when comparing traits among species. Seems to be much more widely applicable than this - ie works for nested data in general.

Edit: giving an example of using the approach, as asked for in @nan's comment below:

First, an explanation of the data. My dataframe is called 'Rat', and it consists of rat capture rates and vegetation variables measured in two forest patches, 'B86Grazed' and 'LittleTutu'. Variables were measured at multiple sites within each patch, and my individual sites were named 'LittleTutu-1', 'LittleTutu-2' etc. My aim was to create PCA axes from the vegetation variables that were uncorrelated once the nesting of sites within patches was taken into account. For example, modelling axis1 as a function of axis2 while specifying patch identity as a random factor would find no significant effect of axis2. [Note also that I wouldnt be fitting a random-effect with only two patches; I've simplified the dataset for the example].

This is the code I used:

library(ape) #creating phylogenetic trees
library(phytools) #phylogenetic PCA

#define tree using Newick format, specifying that sites are nested within patches, and assigning all sites a branch length =1 in this example (indicating all sites within a patch are equally related, and equally unrelated to all sites from the other patch):
cat("((B86Grazed-1:1,B86Grazed-2:1,B86Grazed-3:1,B86Grazed-4:1,B86Grazed-5:1)B86Grazed:1,(LittleTutu-1:1,LittleTutu-2:1,LittleTutu-3:1,LittleTutu-4:1,LittleTutu-5:1,LittleTutu-6:1)LittleTutu:1)Patches;", file = "phyloPCAtree.tre", sep = "\n")
myTree <- read.tree(file="phyloPCAtree.tre")

#check tree structure: should show sites nested within patches
plot.phylo(myTree, show.node.label=T)

#create matrix of vegetation variables I want to reduce, and run phylogenetic PCA:
vegMatrix <- as.matrix(Rat[,c(16,17,19,21,30,31)])  #veg variables were #16,#17 etc in dataframe 
rownames(vegMatrix) <-Rat$Patch.trap #rownames need to match the names of the tips in myTree exactly ('LittleTutu-1' etc). Rat$Patch.trap is a vector of these names.

phyPCA <- phyl.pca(tree=myTree, Y=vegMatrix, mode="corr")   #based on correlation, rather than covariance, matrix
phyPCA      
phyPC1 <- phyPCA$S[,1]  #extract first PCA axis

This approach produced axes that were correlated with my response variable (rat capture rates) but which were uncorrelated with each other when nesting of sites within patches was taken into account. Note that I haven't been able to make this approach work when the data contains 'singleton' patches (i.e. patches with only one site). Note also that the 'read.tree' command removes whitespace from the tree - so best not to have spaces in patch or site names.

Lastly, when I did this phylogenetic PCA it was for a specific application which required uncorrelated axes after nesting was accounted for. Whether this approach should be done any time a PCA is done on nested data is another question (which I'd be interested to hear people's thoughts on).

PCA – Choosing the Number of Principal Components When Multiple Samples Are Available: Methods and Best Practices

I'm not sure that I completely understand the data. Do you have M replicates at N timepoints of a response in $\Re^d$? That would make $N \times M \times D$ actual numbers? Are the trajectories vector valued functions?

If I have understood this correctly, I don't think that your replicates are going to tell you anything about the number of PC's. The replicates enable you to obtain a more accurate estimate of the covariance matrix than you would otherwise have, but in themselves, they don't tell you anything about the relationship between successive time points (which is what the eigen-decomposition is about).

To determine significance, you need some sort of probability model to the effect that the data belong to something in a reduced space (the PC space) with full rank error. Factor analysis posits models of this kind and the factor analysis literature is full of pointers on how to choose the right number of factors (it's as much an art as a science, I believe). Jeremy Anglim's blog is a useful resource, especially This post

Best Answer

Related Solutions

PCA – Principal Components Analysis Creating Correlated Axes with Nested Data

PCA – Choosing the Number of Principal Components When Multiple Samples Are Available: Methods and Best Practices

Related Question