Solved – How exactly is sparse PCA better than PCA

machine learningpcasparse

I learnt about PCA a few lectures ago in class and by digging more about this fascinating concept, I got to know about sparse PCA.

I wanted to ask, if I'm not wrong this is what sparse PCA is:
In PCA, if you have $n$ data points with $p$ variables, you can represent each data point in $p$ dimensional space before applying PCA. After applying PCA, you can again represent it in the same dimensional space, but, this time, the first principal component will contain the most variance, the second will contain the second most variance direction and so on. So you can eliminate the last few principal components, as they will not cause a lot of loss of data, and you can compress the data. Right?

Sparse PCA is selecting principal components such that these components contain less non-zero values in their vector coefficients.

How is this supposed to help you interpret data better?
Can anyone give an example?

Best Answer

Whether sparse PCA is easier to interpret than standard PCA or not, depends on the dataset you are investigating. Here is how I think about it: sometimes one is more interested in the PCA projections (low dimensional representation of the data), and sometimes -- in the principal axes; it is only in the latter case that sparse PCA can have any benefits for the interpretation. Let me give a couple of examples.

I am e.g. working with neural data (simultaneous recordings of many neurons) and am applying PCA and/or related dimensionality reduction techniques to get a low-dimensional representation of neural population activity. I might have 1000 neurons (i.e. my data live in 1000-dimensional space) and want to project it on the three leading principal axes. What these axes are, is totally irrelevant for me, and I have no intention of "interpreting" these axes in any way. What I am interested, is the 3D projection (as the activity depends on time, I get a trajectory in this 3D space). So I am fine if each axis has all 1000 non-zero coefficients.

On the other hand, somebody might be working with more "tangible" data, where individual dimensions have obvious meaning (unlike individual neurons above). E.g. a dataset of various cars, where dimensions are anything from weight to price. In this case one might actually be interested in the leading principal axes themselves, because one might want to say something: look, the 1st principal axis corresponds to the "fanciness" of the car (I am totally making this up now). If the projection is sparse, such interpretations would generally be easier to give, because many variables will have $0$ coefficients and so are obviously irrelevant for this particular axis. In the case of standard PCA, one usually gets non-zero coefficients for all variables.

You can find more examples and some discussion of the latter case in the 2006 Sparse PCA paper by Zou et al. The difference between the former and the latter case, however, I did not see explicitly discussed anywhere (even though it probably was).

Related Question