To find PCs in classical PCA, one can perform singular value decomposition of the centred data matrix (with variables in columns) $\mathbf X = \mathbf U \mathbf S \mathbf V^\top$; columns of $\mathbf U \mathbf S$ are called principal components (i.e. projections of the original data onto the eigenvectors of the covariance matrix). Observe that the so called Gram matrix $\mathbf G = \mathbf X \mathbf X^\top = \mathbf U \mathbf S^2 \mathbf U^\top$ has eigenvectors $\mathbf U$ and eigenvalues $\mathbf S^2$, so another way to compute principal components is to scale eigenvectors of the Gram matrix by the square roots of the respective eigenvalues.
In full analogy, here is a complete algorithm to compute kernel principal components:
Choose a kernel function $k(\mathbf x, \mathbf y)$ that conceptually is a scalar product in the target space.
Compute a Gram/kernel matrix $\mathbf K$ with $K_{ij} = k(\mathbf x_{(i)}, \mathbf x_{(j)})$.
Center the kernel matrix via the following trick: $$\mathbf K_\mathrm{centered} = \mathbf K - \mathbf 1_n \mathbf K - \mathbf K \mathbf 1_n + \mathbf 1_n \mathbf K \mathbf 1_n=(\mathbf I - \mathbf 1_n)\mathbf K(\mathbf I - \mathbf 1_n) ,$$ where $\mathbf 1_n$ is a $n \times n$ matrix with all elements equal to $\frac{1}{n}$, and $n$ is the number of data points.
Find eigenvectors $\mathbf U$ and eigenvalues $\mathbf S^2$ of the centered kernel matrix. Multiply each eigenvector by the square root of the respective eigenvalue.
Done. These are the kernel principal components.
Answering your question specifically, I don't see any need of scaling either eigenvectors or eigenvalues by $n$ in steps 4--5.
A good reference is the original paper: Scholkopf B, Smola A, and Müller KR, Kernel principal component analysis, 1999. Note that it presents the same algorithm in a somewhat more complicated way: you are supposed to find eigenvectors of $K$ and then multiply them by $K$ (as you wrote in your question). But multiplying a matrix and its eigenvector results in the same eigenvector scaled by the eigenvalue (by definition).
Indeed, there is no guarantee that top principal components (PCs) have more predictive power than the low-variance ones.
Real-world examples can be found where this is not the case, and it is easy to construct an artificial example where e.g. only the smallest PC has any relation to $y$ at all.
This topic was discussed a lot on our forum, and in the (unfortunate) absence of one clearly canonical thread, I can only give several links that together provide various real life as well as artificial examples:
And the same topic, but in the context of classification:
However, in practice, top PCs often do often have more predictive power than the low-variance ones, and moreover, using only top PCs can yield better predictive power than using all PCs.
In situations with a lot of predictors $p$ and relatively few data points $n$ (e.g. when $p \approx n$ or even $p>n$), ordinary regression will overfit and needs to be regularized. Principal component regression (PCR) can be seen as one way to regularize the regression and will tend to give superior results. Moreover, it is closely related to ridge regression, which is a standard way of shrinkage regularization. Whereas using ridge regression is usually a better idea, PCR will often behave reasonably well. See Why does shrinkage work? for the general discussion about bias-variance tradeoff and about how shrinkage can be beneficial.
In a way, one can say that both ridge regression and PCR assume that most information about $y$ is contained in the large PCs of $X$, and this assumption is often warranted.
See the later answer by @cbeleites (+1) for some discussion about why this assumption is often warranted (and also this newer thread: Is dimensionality reduction almost always useful for classification? for some further comments).
Hastie et al. in The Elements of Statistical Learning (section 3.4.1) comment on this in the context of ridge regression:
[T]he small singular values [...] correspond to directions in the column space of $\mathbf X$ having small variance, and ridge regression shrinks these directions the most. [...] Ridge regression protects against the potentially high variance
of gradients estimated in the short directions. The implicit assumption is
that the response will tend to vary most in the directions of high variance
of the inputs. This is often a reasonable assumption, since predictors are
often chosen for study because they vary with the response variable, but
need not hold in general.
See my answers in the following threads for details:
Bottom line
For high-dimensional problems, pre-processing with PCA (meaning reducing dimensionality and keeping only top PCs) can be seen as one way of regularization and will often improve the results of any subsequent analysis, be it a regression or a classification method. But there is no guarantee that this will work, and there are often better regularization approaches.
Best Answer
PCA does not get rid of any of the variables, although some may be very unimportant. Suppose you use the first three components from your PCA, each of these is a linear combination of the 20 variables you put into it.
If you want to be able to interpret the importance of the original variables in the regression, I don't think PCA is the way to go. I would consider one of the penalized regression methods such as LASSO or LAR