user11852: your suggestion of doing a phylogenetic PCA worked beautifully thank you!
Interesting that the REvell paper and corresponding R package discuss this technique only in terms of correcting for relatedness when comparing traits among species. Seems to be much more widely applicable than this - ie works for nested data in general.
Edit: giving an example of using the approach, as asked for in @nan's comment below:
First, an explanation of the data. My dataframe is called 'Rat', and it consists of rat capture rates and vegetation variables measured in two forest patches, 'B86Grazed' and 'LittleTutu'. Variables were measured at multiple sites within each patch, and my individual sites were named 'LittleTutu-1', 'LittleTutu-2' etc. My aim was to create PCA axes from the vegetation variables that were uncorrelated once the nesting of sites within patches was taken into account. For example, modelling axis1 as a function of axis2 while specifying patch identity as a random factor would find no significant effect of axis2. [Note also that I wouldnt be fitting a random-effect with only two patches; I've simplified the dataset for the example].
This is the code I used:
library(ape) #creating phylogenetic trees
library(phytools) #phylogenetic PCA
#define tree using Newick format, specifying that sites are nested within patches, and assigning all sites a branch length =1 in this example (indicating all sites within a patch are equally related, and equally unrelated to all sites from the other patch):
cat("((B86Grazed-1:1,B86Grazed-2:1,B86Grazed-3:1,B86Grazed-4:1,B86Grazed-5:1)B86Grazed:1,(LittleTutu-1:1,LittleTutu-2:1,LittleTutu-3:1,LittleTutu-4:1,LittleTutu-5:1,LittleTutu-6:1)LittleTutu:1)Patches;", file = "phyloPCAtree.tre", sep = "\n")
myTree <- read.tree(file="phyloPCAtree.tre")
#check tree structure: should show sites nested within patches
plot.phylo(myTree, show.node.label=T)
#create matrix of vegetation variables I want to reduce, and run phylogenetic PCA:
vegMatrix <- as.matrix(Rat[,c(16,17,19,21,30,31)]) #veg variables were #16,#17 etc in dataframe
rownames(vegMatrix) <-Rat$Patch.trap #rownames need to match the names of the tips in myTree exactly ('LittleTutu-1' etc). Rat$Patch.trap is a vector of these names.
phyPCA <- phyl.pca(tree=myTree, Y=vegMatrix, mode="corr") #based on correlation, rather than covariance, matrix
phyPCA
phyPC1 <- phyPCA$S[,1] #extract first PCA axis
This approach produced axes that were correlated with my response variable (rat capture rates) but which were uncorrelated with each other when nesting of sites within patches was taken into account. Note that I haven't been able to make this approach work when the data contains 'singleton' patches (i.e. patches with only one site). Note also that the 'read.tree' command removes whitespace from the tree - so best not to have spaces in patch or site names.
Lastly, when I did this phylogenetic PCA it was for a specific application which required uncorrelated axes after nesting was accounted for. Whether this approach should be done any time a PCA is done on nested data is another question (which I'd be interested to hear people's thoughts on).
Best Answer
You are going to have issues with performing
PCA
with such amounts of missing data. The analysis is either going to exclude cases with missing values, or some type of imputation will need to be performed. Say you have 100 rows, withVar3
which has only 20 rows (i.e. only 20% of rows have values) you may have issues with accurate imputation, which of course will then create issues for the validity of thePCA
results.