Indeed, there is no guarantee that top principal components (PCs) have more predictive power than the low-variance ones.
Real-world examples can be found where this is not the case, and it is easy to construct an artificial example where e.g. only the smallest PC has any relation to $y$ at all.
This topic was discussed a lot on our forum, and in the (unfortunate) absence of one clearly canonical thread, I can only give several links that together provide various real life as well as artificial examples:
And the same topic, but in the context of classification:
However, in practice, top PCs often do often have more predictive power than the low-variance ones, and moreover, using only top PCs can yield better predictive power than using all PCs.
In situations with a lot of predictors $p$ and relatively few data points $n$ (e.g. when $p \approx n$ or even $p>n$), ordinary regression will overfit and needs to be regularized. Principal component regression (PCR) can be seen as one way to regularize the regression and will tend to give superior results. Moreover, it is closely related to ridge regression, which is a standard way of shrinkage regularization. Whereas using ridge regression is usually a better idea, PCR will often behave reasonably well. See Why does shrinkage work? for the general discussion about bias-variance tradeoff and about how shrinkage can be beneficial.
In a way, one can say that both ridge regression and PCR assume that most information about $y$ is contained in the large PCs of $X$, and this assumption is often warranted.
See the later answer by @cbeleites (+1) for some discussion about why this assumption is often warranted (and also this newer thread: Is dimensionality reduction almost always useful for classification? for some further comments).
Hastie et al. in The Elements of Statistical Learning (section 3.4.1) comment on this in the context of ridge regression:
[T]he small singular values [...] correspond to directions in the column space of $\mathbf X$ having small variance, and ridge regression shrinks these directions the most. [...] Ridge regression protects against the potentially high variance
of gradients estimated in the short directions. The implicit assumption is
that the response will tend to vary most in the directions of high variance
of the inputs. This is often a reasonable assumption, since predictors are
often chosen for study because they vary with the response variable, but
need not hold in general.
See my answers in the following threads for details:
Bottom line
For high-dimensional problems, pre-processing with PCA (meaning reducing dimensionality and keeping only top PCs) can be seen as one way of regularization and will often improve the results of any subsequent analysis, be it a regression or a classification method. But there is no guarantee that this will work, and there are often better regularization approaches.
PCA is a linear transformation, and if you are keeping every dimension, then your data should have the same distance function. Supposing that your original data was some matrix $X$ and your resulting data is some transformed matrix $Y = AX$, then the distance should be unchanged:
$$ d(y_i, y_j) = d(A x_i, A x_j) $$
and if you are using the cosine distance,
$$ d(A x_i, A x_j) = 1 - ((A x_i)^T (A x_j)) / (\|A x_i\| \| A x_j \|) $$
and since the $\| A x_i \| = \| x_i \|$,
$$ = 1 - (x_i^T A^T A x_j) / (\| x_i \| \| x_j \|) $$
and since $A^T A = I$,
$$ = 1 - (x_i^T x_j) / (\| x_i \| \| x_j \|) = d(x_i, x_j) $$
so all this was a very roundabout way to say that distance calculations should not be affected.
Edit, once code appeared: in the calculation of centered
, you are normalizing the features for unit variance. This might be a good idea in general, but it's going to change the distances between points since essentially you are weighting some dimensions to be more or less important than others. In that case, your resulting data is some transformed matrix $Y = ANX$ where $A$ is some orthogonal basis (so $A^T A = I$ as before), but $N$ is just a diagonal matrix that is not necessarily $I$. In that case you can't show that $d(y_i, y_j) = d(x_i, x_j)$. However, you can show that $d(y_i, y_j) = d(N x_i, N x_j)$. That means that if you were to transform your input data to have unit variance in each dimension, then the distance would be the same as the post-PCA data.
Best Answer
That's the nature of the kernels: they take input in one dimension and translate it to other dimension. In your case, it's easy to see applying the polynomial kernel increases the dimensionality of the problem.
Now, I'm not sure what's the implementation you're using for kernel PCA, but 95% of explained variance in a higher dimensional space may, indeed, require that many features. Have you tried to test the procedure yourself?
If you are using
kernlab::kpca
, the functionpcv
returns the principal component vectors arranged columnwise. Ordinary PCA would return 5000 rows aswell, and I guess the same should be expected in other implementations. Check this: