Solved – How to know when to stop reducing dimensions with PCA

classificationdimensionality reductionfeature selectionmachine learningpca

I'm using PCA to reduce dimensionality before I feed the data into a classifier. My bootstrap/cross-validation has shown a significant reduction in test error as a result of applying PCA and keeping the PCs whose standard deviation is a fraction (say, 0.05) of the standard deviation of the first PC. My features are actually histograms (i.e. vector-valued), so instead of applying PCA once globally to the whole dataset, I applied it locally to some features, which I preselected manually based on the number of features (picking the ones with the most columns). I've tried adjusting the aforementioned tolerance, and tried applying PCA to higher and lower numbers of these histogram features.

My question is, can someone please describe a more precise way of finding the optimal amount of dimensionality reduction via PCA as applied above which leads to the highest test accuracy of my classifier? Does it come down to running a loop with a sequence of tolerances and different PCA-treated features and computing the test error for each setting? This would be very computationally expensive.

Best Answer

Note that ridge penalisation/regularisation is basically doing model selection using pca. Although it does it smoothly by shrinking along each principal component axis rather than discretely by dropping small.variance pcs. Note that because you are doing pca on different subsets of variables this would roughly correspond to having different regularising parameters for each group rather than one for all the betas. The elements of statistical learning explains ridging quite well and provides some comparisons.

Related Solutions

Solved – When is it appropriate to use PCA as a preprocessing step

Using PCA for feature selection (removing non-predictive features) is an extremely expensive way to do it. PCA algos are often O(n^3). Rather a much better and more efficient approach would be to use a measure of inter-dependence between the feature and the class - for this Mutual Information tends to perform very well, furthermore it's the only measure of dependence that a) fully generalizes and b) actually has a good philosophical foundation based on Kullback-Leibler divergence.

For example, we compute (using maximum likelihood probability approx with some smoothing)

MI-above-expected = MI(F, C) - E_{X, N}[MI(X, C)]

where the second term is the 'expected mutual information given N examples'. We then take the top M features after sorting by MI-above-expected.

The reason why one would want to use PCA is if one expects that many of the features are in fact dependent. This would be particularly handy for Naive Bayes where independence is assumed. Now the datasets I've worked with have always been far too large to use PCA, so I don't use PCA and we have to use more sophisticated methods. But if your dataset is small, and you don't have the time to investigate more sophisticated methods, then by all means go ahead and apply an out-of-box PCA.

Solved – How to use PCA in regression

Say your predictor matrix is $X$ and your response vector is $y$. PCA is concerned only with the (co)variance within the predictor matrix $X$ itself, while a regression model is (also) concerned with the covariance between $X$ and the response $y$. If there is no relationship between these concepts, dimension reduction by PCA can be harmful to your regression by screening out those predictors in $X$ that are correlated with the response.

Here is a simple example, in R, as I don't have access to matlab. Suppose I create some random gaussian data

x_1 <- rnorm(10000, mean = 0, sd = 1)
x_2 <- rnorm(10000, mean = 0, sd = .1)
X <- cbind(x_1, x_2)

And set up a situation where the a response is correlated with only the smaller varaince component

y <- x_2 + 1

The principal components of $X$ are just $x_1$ and $x_2$, given the way I set it up

> cor(X)
    x_1          x_2
x_1 1.0000000000 0.0004543833
x_2 0.0004543833 1.0000000000

If PCA is used to select one component, we get $x_1$, as this has the highest variance. Regressing $y$ on $x_1$ is useless

> lm(y ~ x_1)

Call:
lm(formula = y ~ x_1)

Coefficients:
(Intercept)          x_1  
  1.001e+00    4.544e-05

The coefficient of $x_1$ here is zero, the model has no more predictive power than an intercept only model. On the other hand, if I select the lower variance component

> lm(y ~ x_2)

Call:
lm(formula = y ~ x_2)

Coefficients:
(Intercept)          x_2  
          1            1

I get back a much more predictive model.

Because PCA ignores the relationship of $y$ to $X$, there is no reason to believe that its dimension reduction is reasonable in the context of your regression problem.

On the other hand using the relationship between $y$ and $X$ to do pre-regression dimension reduction and variable selection is a very good way to overfit your model to your training data. In some ways, PCAs ignorance of the relationship between $X$ and $y$ is a blessing, because, while it can be harmful in the way outlined above, it cannot overfit the relationship in your training data in the same way that peeking at $y$ can.

As for your more practical question, the matlab documentation says

coeff = pca(X) returns the principal component coefficients, also known as loadings, for the n-by-p data matrix X. Rows of X correspond to observations and columns correspond to variables. The coefficient matrix is p-by-p. Each column of coeff contains coefficients for one principal component, and the columns are in descending order of component variance. By default, pca centers the data and uses the singular value decomposition (SVD) algorithm.

This says to me, that to do PCA dimension reduction in matlab, you need to:

Center the columns of your $X$ matrix.
Select the first $N$ columns of the coef matrix, where $N$ is the number of non-intercept regressors you want in your model.
Create a new data matrix as center(X) * coef[, 1:N].
Use the columns in the new matrix as regressors in your dimensional reduced regression.

Again, I am far from fluent in matlab, so the syntactic details of how to preform these steps is unknown to me.

Best Answer

Related Solutions

Solved – When is it appropriate to use PCA as a preprocessing step

Solved – How to use PCA in regression

Related Question