Principal component analysis (PCA) is used to reduce the dimensions in our data set. While explaining PCA, they say that they are projecting the data to where there is huge variance; is that the same as selecting only the most important features and ignoring the others?
Solved – Does PCA mean selecting most important features and ignoring the others
machine learningpca
Related Solutions
Your understanding is right. Have a look at this figure which represents various possibilities of your data points: http://shapeofdata.files.wordpress.com/2013/02/pca22.png
They look ellipsoidal. If you do what you've described above i.e. compress the points in the direction in which they are spread the most (approx the 45 degree line in the image), the points will be lying in a circle (sphere in higher dimensions).
One reason you spherify the data is while doing prediction and understanding which coordinates are important. Say you wish to predict $y$ using $x_1$ and $x_2$, and you get coefficient values $\beta_1$ and $\beta_2$ i.e. $y\sim \beta_1 x_1+\beta_2x_2 $. Now if $x_1$ and $x_2$ have the same variance, i.e. they are roughly distributed spherically, and you find that $\beta_1=1$ while $\beta_2=10$, you can interpret this has saying that $x_2$ influences $y$ more than $x_1$. If their scales were not the same however, and $x_1$ was distributed 10 times more than $x_2$, then you would get the above values of $\beta_1$ and $\beta_2$ even if they both influenced $y$ roughly the same. To summarize, you "spherify" or "normalize" to make inferences about the variable's importance from its coefficient.
Summary: PCA can be performed before LDA to regularize the problem and avoid over-fitting.
Recall that LDA projections are computed via eigendecomposition of $\boldsymbol \Sigma_W^{-1} \boldsymbol \Sigma_B$, where $\boldsymbol \Sigma_W$ and $\boldsymbol \Sigma_B$ are within- and between-class covariance matrices. If there are less than $N$ data points (where $N$ is the dimensionality of your space, i.e. the number of features/variables), then $\boldsymbol \Sigma_W$ will be singular and therefore cannot be inverted. In this case there is simply no way to perform LDA directly, but if one applies PCA first, it will work. @Aaron made this remark in the comments to his reply, and I agree with that (but disagree with his answer in general, as you will see now).
However, this is only part of the problem. The bigger picture is that LDA very easily tends to overfit the data. Note that within-class covariance matrix gets inverted in the LDA computations; for high-dimensional matrices inversion is a really sensitive operation that can only be reliably done if the estimate of $\boldsymbol \Sigma_W$ is really good. But in high dimensions $N \gg 1$, it is really difficult to obtain a precise estimate of $\boldsymbol \Sigma_W$, and in practice one often has to have a lot more than $N$ data points to start hoping that the estimate is good. Otherwise $\boldsymbol \Sigma_W$ will be almost-singular (i.e. some of the eigenvalues will be very low), and this will cause over-fitting, i.e. near-perfect class separation on the training data with chance performance on the test data.
To tackle this issue, one needs to regularize the problem. One way to do it is to use PCA to reduce dimensionality first. There are other, arguably better ones, e.g. regularized LDA (rLDA) method which simply uses $(1-\lambda)\boldsymbol \Sigma_W + \lambda \boldsymbol I$ with small $\lambda$ instead of $\boldsymbol \Sigma_W$ (this is called shrinkage estimator), but doing PCA first is conceptually the simplest approach and often works just fine.
Illustration
Here is an illustration of the over-fitting problem. I generated 60 samples per class in 3 classes from standard Gaussian distribution (mean zero, unit variance) in 10-, 50-, 100-, and 150-dimensional spaces, and applied LDA to project the data on 2D:
Note how as the dimensionality grows, classes become better and better separated, whereas in reality there is no difference between the classes.
We can see how PCA helps to prevent the overfitting if we make classes slightly separated. I added 1 to the first coordinate of the first class, 2 to the first coordinate of the second class, and 3 to the first coordinate of the third class. Now they are slightly separated, see top left subplot:
Overfitting (top row) is still obvious. But if I pre-process the data with PCA, always keeping 10 dimensions (bottom row), overfitting disappears while the classes remain near-optimally separated.
PS. To prevent misunderstandings: I am not claiming that PCA+LDA is a good regularization strategy (on the contrary, I would advice to use rLDA), I am simply demonstrating that it is a possible strategy.
Update. Very similar topic has been previously discussed in the following threads with interesting and comprehensive answers provided by @cbeleites:
- Should PCA be performed before I do classification?
- Does it make sense to run LDA on several principal components and not on all variables?
See also this question with some good answers:
Best Answer
Fingers crossed I can help. PCA, at its core, doesn't select the most "important" features. Really what it is is a linear transformation of your data to a new coordinate system where the first component direction is the one which has the largest variance (same for second, third...). The dimensionality reduction comes from choosing to keep a subset of the components. So it's not that PCA "selects" the most important features but rather it finds linear combinations of existing variables and the user decides how many of those new combinations to keep.
If I can guess what you actually care about, I'd say that the main issue with PCA and it's ability to "select the most important features" is that the components which are chosen can be very complicated linear combinations of the variables in your dataset. Often, this leads to the issue of component interpretability. That is, because your components are these strange combinations of real variables, it becomes very difficult (if not impossible) to identify a real world quantity with your component. If your goal is to reduce the dimensionality of your dataset while still having the ability (hopefully) to say that your new, smaller set of variables is a collection of observable variables then I'd look into sparse PCA. Sparse PCA is a technique that tries to accomplish the same goal as PCA but with the added constraint that the final linear combinations of variables should include a small subset of the original variables.