Solved – Applying principal component analysis on variables of different dimensions

pca

Is it possible to perform principal component analysis on variables that have different number of rows of data?

For instance, if there were just three variables – Var1, Var2, Var3 – where Var1 has 100 rows of data, Var2 has 50 rows, and Var3 has 20 rows. Can PCA be performed on such a data set? If PCA is possible, do I simply pad the empty ones with zero or is some sort of transformation needed?

Had all variables been the same dimension (e.g. Var1, Var2, Var3 all has 100 rows of data), then there would be no issue. Unfortunately my data set is not so trivial.

Please advise. Thanks!

Best Answer

You are going to have issues with performing PCA with such amounts of missing data. The analysis is either going to exclude cases with missing values, or some type of imputation will need to be performed. Say you have 100 rows, with Var3 which has only 20 rows (i.e. only 20% of rows have values) you may have issues with accurate imputation, which of course will then create issues for the validity of the PCA results.

Related Solutions

PCA – Principal Components Analysis Creating Correlated Axes with Nested Data

user11852: your suggestion of doing a phylogenetic PCA worked beautifully thank you!

Interesting that the REvell paper and corresponding R package discuss this technique only in terms of correcting for relatedness when comparing traits among species. Seems to be much more widely applicable than this - ie works for nested data in general.

Edit: giving an example of using the approach, as asked for in @nan's comment below:

First, an explanation of the data. My dataframe is called 'Rat', and it consists of rat capture rates and vegetation variables measured in two forest patches, 'B86Grazed' and 'LittleTutu'. Variables were measured at multiple sites within each patch, and my individual sites were named 'LittleTutu-1', 'LittleTutu-2' etc. My aim was to create PCA axes from the vegetation variables that were uncorrelated once the nesting of sites within patches was taken into account. For example, modelling axis1 as a function of axis2 while specifying patch identity as a random factor would find no significant effect of axis2. [Note also that I wouldnt be fitting a random-effect with only two patches; I've simplified the dataset for the example].

This is the code I used:

library(ape) #creating phylogenetic trees
library(phytools) #phylogenetic PCA

#define tree using Newick format, specifying that sites are nested within patches, and assigning all sites a branch length =1 in this example (indicating all sites within a patch are equally related, and equally unrelated to all sites from the other patch):
cat("((B86Grazed-1:1,B86Grazed-2:1,B86Grazed-3:1,B86Grazed-4:1,B86Grazed-5:1)B86Grazed:1,(LittleTutu-1:1,LittleTutu-2:1,LittleTutu-3:1,LittleTutu-4:1,LittleTutu-5:1,LittleTutu-6:1)LittleTutu:1)Patches;", file = "phyloPCAtree.tre", sep = "\n")
myTree <- read.tree(file="phyloPCAtree.tre")

#check tree structure: should show sites nested within patches
plot.phylo(myTree, show.node.label=T)

#create matrix of vegetation variables I want to reduce, and run phylogenetic PCA:
vegMatrix <- as.matrix(Rat[,c(16,17,19,21,30,31)])  #veg variables were #16,#17 etc in dataframe 
rownames(vegMatrix) <-Rat$Patch.trap #rownames need to match the names of the tips in myTree exactly ('LittleTutu-1' etc). Rat$Patch.trap is a vector of these names.

phyPCA <- phyl.pca(tree=myTree, Y=vegMatrix, mode="corr")   #based on correlation, rather than covariance, matrix
phyPCA      
phyPC1 <- phyPCA$S[,1]  #extract first PCA axis

This approach produced axes that were correlated with my response variable (rat capture rates) but which were uncorrelated with each other when nesting of sites within patches was taken into account. Note that I haven't been able to make this approach work when the data contains 'singleton' patches (i.e. patches with only one site). Note also that the 'read.tree' command removes whitespace from the tree - so best not to have spaces in patch or site names.

Lastly, when I did this phylogenetic PCA it was for a specific application which required uncorrelated axes after nesting was accounted for. Whether this approach should be done any time a PCA is done on nested data is another question (which I'd be interested to hear people's thoughts on).

Solved – 100% of variance explained by one principal component

This means all your variables can be written as a linear transformation of a single one of them, which is a pretty extreme case of linear dependence.

(Why? Because all the principal components of a dataset together are always a basis for the vector space spanned by the original variables.)

Best Answer

Related Solutions

PCA – Principal Components Analysis Creating Correlated Axes with Nested Data

Solved – 100% of variance explained by one principal component

Related Question