PCA – Principal Components Analysis Creating Correlated Axes with Nested Data

mixed modelpca

I'm trying to do a principal components analysis with the aim of turning my set of correlated variables into a set of uncorrelated ones (rather than dimension reduction). However, the data are nested, and when I account for this nesting some of my components are highly correlated.

Specifically, my original variables are measured at multiple sites within a number of forest patches. When I run a standard PCA none of the principal components are correlated across the dataset as a whole, but when I model one component as a function of another in a mixed effects model which includes patch as a random factor, the relationship between some pairs of components is highly significant.

Can someone please tell me if there is a way to do a PCA while accounting for nested data structure, or if not, if there is another approach I could use to convert a set of correlated variables into a set of uncorrelated ones while accounting for nesting?

Thank you for your help,

Jay

Best Answer

user11852: your suggestion of doing a phylogenetic PCA worked beautifully thank you!

Interesting that the REvell paper and corresponding R package discuss this technique only in terms of correcting for relatedness when comparing traits among species. Seems to be much more widely applicable than this - ie works for nested data in general.

Edit: giving an example of using the approach, as asked for in @nan's comment below:

First, an explanation of the data. My dataframe is called 'Rat', and it consists of rat capture rates and vegetation variables measured in two forest patches, 'B86Grazed' and 'LittleTutu'. Variables were measured at multiple sites within each patch, and my individual sites were named 'LittleTutu-1', 'LittleTutu-2' etc. My aim was to create PCA axes from the vegetation variables that were uncorrelated once the nesting of sites within patches was taken into account. For example, modelling axis1 as a function of axis2 while specifying patch identity as a random factor would find no significant effect of axis2. [Note also that I wouldnt be fitting a random-effect with only two patches; I've simplified the dataset for the example].

This is the code I used:

library(ape) #creating phylogenetic trees
library(phytools) #phylogenetic PCA

#define tree using Newick format, specifying that sites are nested within patches, and assigning all sites a branch length =1 in this example (indicating all sites within a patch are equally related, and equally unrelated to all sites from the other patch):
cat("((B86Grazed-1:1,B86Grazed-2:1,B86Grazed-3:1,B86Grazed-4:1,B86Grazed-5:1)B86Grazed:1,(LittleTutu-1:1,LittleTutu-2:1,LittleTutu-3:1,LittleTutu-4:1,LittleTutu-5:1,LittleTutu-6:1)LittleTutu:1)Patches;", file = "phyloPCAtree.tre", sep = "\n")
myTree <- read.tree(file="phyloPCAtree.tre")

#check tree structure: should show sites nested within patches
plot.phylo(myTree, show.node.label=T)

#create matrix of vegetation variables I want to reduce, and run phylogenetic PCA:
vegMatrix <- as.matrix(Rat[,c(16,17,19,21,30,31)])  #veg variables were #16,#17 etc in dataframe 
rownames(vegMatrix) <-Rat$Patch.trap #rownames need to match the names of the tips in myTree exactly ('LittleTutu-1' etc). Rat$Patch.trap is a vector of these names.

phyPCA <- phyl.pca(tree=myTree, Y=vegMatrix, mode="corr")   #based on correlation, rather than covariance, matrix
phyPCA      
phyPC1 <- phyPCA$S[,1]  #extract first PCA axis     

This approach produced axes that were correlated with my response variable (rat capture rates) but which were uncorrelated with each other when nesting of sites within patches was taken into account. Note that I haven't been able to make this approach work when the data contains 'singleton' patches (i.e. patches with only one site). Note also that the 'read.tree' command removes whitespace from the tree - so best not to have spaces in patch or site names.

Lastly, when I did this phylogenetic PCA it was for a specific application which required uncorrelated axes after nesting was accounted for. Whether this approach should be done any time a PCA is done on nested data is another question (which I'd be interested to hear people's thoughts on).

Related Question