Indeed, there is no guarantee that top principal components (PCs) have more predictive power than the low-variance ones.
Real-world examples can be found where this is not the case, and it is easy to construct an artificial example where e.g. only the smallest PC has any relation to $y$ at all.
This topic was discussed a lot on our forum, and in the (unfortunate) absence of one clearly canonical thread, I can only give several links that together provide various real life as well as artificial examples:
And the same topic, but in the context of classification:
However, in practice, top PCs often do often have more predictive power than the low-variance ones, and moreover, using only top PCs can yield better predictive power than using all PCs.
In situations with a lot of predictors $p$ and relatively few data points $n$ (e.g. when $p \approx n$ or even $p>n$), ordinary regression will overfit and needs to be regularized. Principal component regression (PCR) can be seen as one way to regularize the regression and will tend to give superior results. Moreover, it is closely related to ridge regression, which is a standard way of shrinkage regularization. Whereas using ridge regression is usually a better idea, PCR will often behave reasonably well. See Why does shrinkage work? for the general discussion about bias-variance tradeoff and about how shrinkage can be beneficial.
In a way, one can say that both ridge regression and PCR assume that most information about $y$ is contained in the large PCs of $X$, and this assumption is often warranted.
See the later answer by @cbeleites (+1) for some discussion about why this assumption is often warranted (and also this newer thread: Is dimensionality reduction almost always useful for classification? for some further comments).
Hastie et al. in The Elements of Statistical Learning (section 3.4.1) comment on this in the context of ridge regression:
[T]he small singular values [...] correspond to directions in the column space of $\mathbf X$ having small variance, and ridge regression shrinks these directions the most. [...] Ridge regression protects against the potentially high variance
of gradients estimated in the short directions. The implicit assumption is
that the response will tend to vary most in the directions of high variance
of the inputs. This is often a reasonable assumption, since predictors are
often chosen for study because they vary with the response variable, but
need not hold in general.
See my answers in the following threads for details:
Bottom line
For high-dimensional problems, pre-processing with PCA (meaning reducing dimensionality and keeping only top PCs) can be seen as one way of regularization and will often improve the results of any subsequent analysis, be it a regression or a classification method. But there is no guarantee that this will work, and there are often better regularization approaches.
Why Kernel PCA increased dimensionality compared to PCA?
That's the nature of the kernels: they take input in one dimension and translate it to other dimension. In your case, it's easy to see applying the polynomial kernel increases the dimensionality of the problem.
Now, I'm not sure what's the implementation you're using for kernel PCA, but 95% of explained variance in a higher dimensional space may, indeed, require that many features. Have you tried to test the procedure yourself?
library(kernlab)
#data = as.matrix(iris[,-5]) #input your data
kpcl = kpca(data, kernel = "vanilladot", kpar = list())
kpcp = kpca(data, kernel = "polydot", kpar = list(degree = 3))
#this gives the explained variance
cumsum(eig(kpcl))/sum(eig(kpcl))
cumsum(eig(kpcp))/sum(eig(kpcp))
Why did kernel PCA give me 5000 eigenvectors?
If you are using kernlab::kpca
, the function pcv
returns the principal component vectors arranged columnwise. Ordinary PCA would return 5000 rows aswell, and I guess the same should be expected in other implementations. Check this:
#both are n_samples by n_components matrices
pcv(kpcl)
pcv(kpcp)
Best Answer
Blindly using PCA is a recipe for disaster. (As an aside, automatically applying any method is not a good idea, because what works in one context is not guaranteed to work in another. We can formalize this intuitive idea with the No Free Lunch theorem.)
It's easy enough to construct an example where the eigenvectors to the smallest eigenvalues are the most informative. If you discard this data, you're discarding the most helpful information for your classification or regression problem, and your model would be improved if you had retained them.
More concretely, suppose $A$ is our design matrix, and each column is mean-centered. Then we can use SVD to compute the PCA of $A$. (see: Relationship between SVD and PCA. How to use SVD to perform PCA?)
For an example in the case of a linear model, this gives us a factorization $$ AV = US $$
and we wish to predict some outcome $y$ as a linear combination of the PCs: $AV\beta = y+\epsilon$ where $\epsilon$ is some noise. Further, let's assume that this linear model is the correct model.
In general, the vector $\beta$ can be anything, just as in an ordinary OLS regression setting; but in any particular problem, it's possible that the only nonzero elements of $\beta$ are the ones corresponding to the smallest positive singular values. Whenever this is the case, using PCA to reduce the dimension of $AV$ by discarding the smallest singular values will also discard the only relevant predictors of $y$. In other words, even though we started out with the correct model, the truncated model is not correct because it omits the key variables.
In other words, PCA has a weakness in a supervised learning scenario because it is not "$y$-aware." Of course, in the cases where PCA is a helpful step, then $\beta$ will have nonzero entries corresponding to the larger singular values.
I think this example is instructive because it shows that even in the special case that the model is linear, truncating $AV$ risks discarding information.
Other common objections include:
PCA is a linear model, but the relationships among features may not have the form of a linear factorization. This implies that PCA will be a distortion.
PCA can be hard to interpret, because it tends to yield "dense" factorizations, where all features in $A$ have nonzero effect on each PC.
Here's another example: The first principal component does not separate classes, but other PCs do; how is that possible?
Some more examples can be found in this closely-related thread (thanks, @gung!): Examples of PCA where PCs with low variance are "useful"