Solved – Does curse of dimensionality also affect principal component analysis calculations

dimensionality reductionpca

Based on this post, the Big-O notation for the complexity of calculating principal components analysis is $O(p^2n+p^3)$ for a dataset of size $n$ with $p$ features.

I understand that PCA is often performed for dimension reduction purposes, however given the seemingly high cost in computational resources for performing PCA, wouldn't the PCA itself be too expensive to calculate in many cases?

For example if I can't fit a given linear regression or machine learning model to a large training dataset because there are far too many features, wouldn't there be too many features to perform a PCA as well?

Best Answer

The "curse of dimensionality" does not usually refer to the increased computation time required for datasets with many features. While dimensionality reduction methods do have the added benefit of reducing the number of features, and consequently the computation time, it's oftentimes not the primary goal. Rather, the "curse of dimensionality" refers to the exponential growth of the sample space with the number of dimensions, and the resulting sparsity of the data that occupy that space. In a high dimensional space, even a reasonably large dataset will inadequately cover the sample space, and nearly all points will look "far away" from any query point, making distance metrics less meaningful. The biggest benefit of dimensionality reduction isn't to save time, it's to make the downstream algorithm work better.

Big O notation describes the time complexity of an algorithm in terms of the number of inputs, but isn't a good tool to actually compare running times of different methods. You could have an O(n) algorithm that takes longer to run than an O(n^2) algorithm, if each of those O(n) operations is expensive. With large enough n, the O(n^2) method will eventually run slower, as the number of operations will increase much faster, so at some point there will be enough individually faster operations that the whole thing runs slower, but we can't know what that n is without a lot more detail. The fact that PCA has a Big O complexity greater than that of logistic regression is not in itself an indication that the PCA algorithm will actually take longer to run. All it says is that this will eventually be true when n is large enough, but we have no indication of how large that n might need to be.

Related Solutions

Regression – Advantages of Reducing Dimensionality of Predictors

According to the manifold hypothesis, the data is assumed to lie on a low-dimensional manifold, the implication being that the residual is noise, so if you do your dimensionality reduction correctly, you should improve performance by modeling the signal rather than the noise. It's not just a question of space and complexity.

PCA – Principal Components Analysis Creating Correlated Axes with Nested Data

user11852: your suggestion of doing a phylogenetic PCA worked beautifully thank you!

Interesting that the REvell paper and corresponding R package discuss this technique only in terms of correcting for relatedness when comparing traits among species. Seems to be much more widely applicable than this - ie works for nested data in general.

Edit: giving an example of using the approach, as asked for in @nan's comment below:

First, an explanation of the data. My dataframe is called 'Rat', and it consists of rat capture rates and vegetation variables measured in two forest patches, 'B86Grazed' and 'LittleTutu'. Variables were measured at multiple sites within each patch, and my individual sites were named 'LittleTutu-1', 'LittleTutu-2' etc. My aim was to create PCA axes from the vegetation variables that were uncorrelated once the nesting of sites within patches was taken into account. For example, modelling axis1 as a function of axis2 while specifying patch identity as a random factor would find no significant effect of axis2. [Note also that I wouldnt be fitting a random-effect with only two patches; I've simplified the dataset for the example].

This is the code I used:

library(ape) #creating phylogenetic trees
library(phytools) #phylogenetic PCA

#define tree using Newick format, specifying that sites are nested within patches, and assigning all sites a branch length =1 in this example (indicating all sites within a patch are equally related, and equally unrelated to all sites from the other patch):
cat("((B86Grazed-1:1,B86Grazed-2:1,B86Grazed-3:1,B86Grazed-4:1,B86Grazed-5:1)B86Grazed:1,(LittleTutu-1:1,LittleTutu-2:1,LittleTutu-3:1,LittleTutu-4:1,LittleTutu-5:1,LittleTutu-6:1)LittleTutu:1)Patches;", file = "phyloPCAtree.tre", sep = "\n")
myTree <- read.tree(file="phyloPCAtree.tre")

#check tree structure: should show sites nested within patches
plot.phylo(myTree, show.node.label=T)

#create matrix of vegetation variables I want to reduce, and run phylogenetic PCA:
vegMatrix <- as.matrix(Rat[,c(16,17,19,21,30,31)])  #veg variables were #16,#17 etc in dataframe 
rownames(vegMatrix) <-Rat$Patch.trap #rownames need to match the names of the tips in myTree exactly ('LittleTutu-1' etc). Rat$Patch.trap is a vector of these names.

phyPCA <- phyl.pca(tree=myTree, Y=vegMatrix, mode="corr")   #based on correlation, rather than covariance, matrix
phyPCA      
phyPC1 <- phyPCA$S[,1]  #extract first PCA axis

This approach produced axes that were correlated with my response variable (rat capture rates) but which were uncorrelated with each other when nesting of sites within patches was taken into account. Note that I haven't been able to make this approach work when the data contains 'singleton' patches (i.e. patches with only one site). Note also that the 'read.tree' command removes whitespace from the tree - so best not to have spaces in patch or site names.

Lastly, when I did this phylogenetic PCA it was for a specific application which required uncorrelated axes after nesting was accounted for. Whether this approach should be done any time a PCA is done on nested data is another question (which I'd be interested to hear people's thoughts on).

Best Answer

Related Solutions

Regression – Advantages of Reducing Dimensionality of Predictors

PCA – Principal Components Analysis Creating Correlated Axes with Nested Data

Related Question