Solved – Recover data after PCA

pca

I have a linear regression problem with about 120 predictors and I tried to remove a number of predictors from it.
First I tried to remove multi-collinearities by calculating the variance inflation factor. This left me with about 20 different (hopefully not collinear anymore) predictors.
Then I used a PCA to reduce dimensionality even further. Because the predictors' variances are very different to one another I used the correlation matrix for this.

I can get the 'final' data when I multiply the eigenvectors with the largest eigenvalues with my original data, right?

In the end I want to find out which original predictors are left and how I can recover the 'new' original data. But for some reason I am not able to recover correct numbers.

Best Answer

PCA does not get rid of any of the variables, although some may be very unimportant. Suppose you use the first three components from your PCA, each of these is a linear combination of the 20 variables you put into it.

If you want to be able to interpret the importance of the original variables in the regression, I don't think PCA is the way to go. I would consider one of the penalized regression methods such as LASSO or LAR

Related Solutions

Solved – What exactly is the procedure to compute principal components in kernel PCA

To find PCs in classical PCA, one can perform singular value decomposition of the centred data matrix (with variables in columns) $\mathbf X = \mathbf U \mathbf S \mathbf V^\top$; columns of $\mathbf U \mathbf S$ are called principal components (i.e. projections of the original data onto the eigenvectors of the covariance matrix). Observe that the so called Gram matrix $\mathbf G = \mathbf X \mathbf X^\top = \mathbf U \mathbf S^2 \mathbf U^\top$ has eigenvectors $\mathbf U$ and eigenvalues $\mathbf S^2$, so another way to compute principal components is to scale eigenvectors of the Gram matrix by the square roots of the respective eigenvalues.

In full analogy, here is a complete algorithm to compute kernel principal components:

Choose a kernel function $k(\mathbf x, \mathbf y)$ that conceptually is a scalar product in the target space.
Compute a Gram/kernel matrix $\mathbf K$ with $K_{ij} = k(\mathbf x_{(i)}, \mathbf x_{(j)})$.
Center the kernel matrix via the following trick: $$\mathbf K_\mathrm{centered} = \mathbf K - \mathbf 1_n \mathbf K - \mathbf K \mathbf 1_n + \mathbf 1_n \mathbf K \mathbf 1_n=(\mathbf I - \mathbf 1_n)\mathbf K(\mathbf I - \mathbf 1_n) ,$$ where $\mathbf 1_n$ is a $n \times n$ matrix with all elements equal to $\frac{1}{n}$, and $n$ is the number of data points.
Find eigenvectors $\mathbf U$ and eigenvalues $\mathbf S^2$ of the centered kernel matrix. Multiply each eigenvector by the square root of the respective eigenvalue.
Done. These are the kernel principal components.

Answering your question specifically, I don't see any need of scaling either eigenvectors or eigenvalues by $n$ in steps 4--5.

A good reference is the original paper: Scholkopf B, Smola A, and Müller KR, Kernel principal component analysis, 1999. Note that it presents the same algorithm in a somewhat more complicated way: you are supposed to find eigenvectors of $K$ and then multiply them by $K$ (as you wrote in your question). But multiplying a matrix and its eigenvector results in the same eigenvector scaled by the eigenvalue (by definition).

Regression – How Top Principal Components Retain Predictive Power for Dependent Variables: PCA and Dimensionality Reduction

Indeed, there is no guarantee that top principal components (PCs) have more predictive power than the low-variance ones.

Real-world examples can be found where this is not the case, and it is easy to construct an artificial example where e.g. only the smallest PC has any relation to $y$ at all.

This topic was discussed a lot on our forum, and in the (unfortunate) absence of one clearly canonical thread, I can only give several links that together provide various real life as well as artificial examples:

And the same topic, but in the context of classification:

However, in practice, top PCs often do often have more predictive power than the low-variance ones, and moreover, using only top PCs can yield better predictive power than using all PCs.

In situations with a lot of predictors $p$ and relatively few data points $n$ (e.g. when $p \approx n$ or even $p>n$), ordinary regression will overfit and needs to be regularized. Principal component regression (PCR) can be seen as one way to regularize the regression and will tend to give superior results. Moreover, it is closely related to ridge regression, which is a standard way of shrinkage regularization. Whereas using ridge regression is usually a better idea, PCR will often behave reasonably well. See Why does shrinkage work? for the general discussion about bias-variance tradeoff and about how shrinkage can be beneficial.

In a way, one can say that both ridge regression and PCR assume that most information about $y$ is contained in the large PCs of $X$, and this assumption is often warranted.

See the later answer by @cbeleites (+1) for some discussion about why this assumption is often warranted (and also this newer thread: Is dimensionality reduction almost always useful for classification? for some further comments).

Hastie et al. in The Elements of Statistical Learning (section 3.4.1) comment on this in the context of ridge regression:

[T]he small singular values [...] correspond to directions in the column space of $\mathbf X$ having small variance, and ridge regression shrinks these directions the most. [...] Ridge regression protects against the potentially high variance of gradients estimated in the short directions. The implicit assumption is that the response will tend to vary most in the directions of high variance of the inputs. This is often a reasonable assumption, since predictors are often chosen for study because they vary with the response variable, but need not hold in general.

See my answers in the following threads for details:

Bottom line

For high-dimensional problems, pre-processing with PCA (meaning reducing dimensionality and keeping only top PCs) can be seen as one way of regularization and will often improve the results of any subsequent analysis, be it a regression or a classification method. But there is no guarantee that this will work, and there are often better regularization approaches.

Best Answer

Related Solutions

Solved – What exactly is the procedure to compute principal components in kernel PCA

Regression – How Top Principal Components Retain Predictive Power for Dependent Variables: PCA and Dimensionality Reduction

Bottom line

Related Question