Solved – Does curse of dimensionality also affect principal component analysis calculations

dimensionality reductionpca

Based on this post, the Big-O notation for the complexity of calculating principal components analysis is $O(p^2n+p^3)$ for a dataset of size $n$ with $p$ features.

I understand that PCA is often performed for dimension reduction purposes, however given the seemingly high cost in computational resources for performing PCA, wouldn't the PCA itself be too expensive to calculate in many cases?

For example if I can't fit a given linear regression or machine learning model to a large training dataset because there are far too many features, wouldn't there be too many features to perform a PCA as well?

Best Answer

The "curse of dimensionality" does not usually refer to the increased computation time required for datasets with many features. While dimensionality reduction methods do have the added benefit of reducing the number of features, and consequently the computation time, it's oftentimes not the primary goal. Rather, the "curse of dimensionality" refers to the exponential growth of the sample space with the number of dimensions, and the resulting sparsity of the data that occupy that space. In a high dimensional space, even a reasonably large dataset will inadequately cover the sample space, and nearly all points will look "far away" from any query point, making distance metrics less meaningful. The biggest benefit of dimensionality reduction isn't to save time, it's to make the downstream algorithm work better.

Big O notation describes the time complexity of an algorithm in terms of the number of inputs, but isn't a good tool to actually compare running times of different methods. You could have an O(n) algorithm that takes longer to run than an O(n^2) algorithm, if each of those O(n) operations is expensive. With large enough n, the O(n^2) method will eventually run slower, as the number of operations will increase much faster, so at some point there will be enough individually faster operations that the whole thing runs slower, but we can't know what that n is without a lot more detail. The fact that PCA has a Big O complexity greater than that of logistic regression is not in itself an indication that the PCA algorithm will actually take longer to run. All it says is that this will eventually be true when n is large enough, but we have no indication of how large that n might need to be.