Using PCA for feature selection (removing non-predictive features) is an extremely expensive way to do it. PCA algos are often O(n^3). Rather a much better and more efficient approach would be to use a measure of inter-dependence between the feature and the class - for this Mutual Information tends to perform very well, furthermore it's the only measure of dependence that a) fully generalizes and b) actually has a good philosophical foundation based on Kullback-Leibler divergence.
For example, we compute (using maximum likelihood probability approx with some smoothing)
MI-above-expected = MI(F, C) - E_{X, N}[MI(X, C)]
where the second term is the 'expected mutual information given N examples'. We then take the top M features after sorting by MI-above-expected.
The reason why one would want to use PCA is if one expects that many of the features are in fact dependent. This would be particularly handy for Naive Bayes where independence is assumed. Now the datasets I've worked with have always been far too large to use PCA, so I don't use PCA and we have to use more sophisticated methods. But if your dataset is small, and you don't have the time to investigate more sophisticated methods, then by all means go ahead and apply an out-of-box PCA.
Say your predictor matrix is $X$ and your response vector is $y$. PCA is concerned only with the (co)variance within the predictor matrix $X$ itself, while a regression model is (also) concerned with the covariance between $X$ and the response $y$. If there is no relationship between these concepts, dimension reduction by PCA can be harmful to your regression by screening out those predictors in $X$ that are correlated with the response.
Here is a simple example, in R, as I don't have access to matlab. Suppose I create some random gaussian data
x_1 <- rnorm(10000, mean = 0, sd = 1)
x_2 <- rnorm(10000, mean = 0, sd = .1)
X <- cbind(x_1, x_2)
And set up a situation where the a response is correlated with only the smaller varaince component
y <- x_2 + 1
The principal components of $X$ are just $x_1$ and $x_2$, given the way I set it up
> cor(X)
x_1 x_2
x_1 1.0000000000 0.0004543833
x_2 0.0004543833 1.0000000000
If PCA is used to select one component, we get $x_1$, as this has the highest variance. Regressing $y$ on $x_1$ is useless
> lm(y ~ x_1)
Call:
lm(formula = y ~ x_1)
Coefficients:
(Intercept) x_1
1.001e+00 4.544e-05
The coefficient of $x_1$ here is zero, the model has no more predictive power than an intercept only model. On the other hand, if I select the lower variance component
> lm(y ~ x_2)
Call:
lm(formula = y ~ x_2)
Coefficients:
(Intercept) x_2
1 1
I get back a much more predictive model.
Because PCA ignores the relationship of $y$ to $X$, there is no reason to believe that its dimension reduction is reasonable in the context of your regression problem.
On the other hand using the relationship between $y$ and $X$ to do pre-regression dimension reduction and variable selection is a very good way to overfit your model to your training data. In some ways, PCAs ignorance of the relationship between $X$ and $y$ is a blessing, because, while it can be harmful in the way outlined above, it cannot overfit the relationship in your training data in the same way that peeking at $y$ can.
As for your more practical question, the matlab documentation says
coeff = pca(X)
returns the principal component coefficients, also known as loadings, for the n-by-p data matrix X. Rows of X correspond to observations and columns correspond to variables. The coefficient matrix is p-by-p. Each column of coeff contains coefficients for one principal component, and the columns are in descending order of component variance. By default, pca centers the data and uses the singular value decomposition (SVD) algorithm.
This says to me, that to do PCA dimension reduction in matlab, you need to:
- Center the columns of your $X$ matrix.
- Select the first $N$ columns of the
coef
matrix, where $N$ is the number of non-intercept regressors you want in your model.
- Create a new data matrix as
center(X) * coef[, 1:N]
.
- Use the columns in the new matrix as regressors in your dimensional reduced regression.
Again, I am far from fluent in matlab, so the syntactic details of how to preform these steps is unknown to me.
Best Answer
Note that ridge penalisation/regularisation is basically doing model selection using pca. Although it does it smoothly by shrinking along each principal component axis rather than discretely by dropping small.variance pcs. Note that because you are doing pca on different subsets of variables this would roughly correspond to having different regularising parameters for each group rather than one for all the betas. The elements of statistical learning explains ridging quite well and provides some comparisons.