Solved – Supervised dimensionality reduction

dimensionality reductiondiscriminant analysismachine learningpcasupervised learning

I have a data set consisting of 15K labeled samples (of 10 groups). I want to apply dimensionality reduction into 2 dimensions, that would take into consideration the knowledge of the labels.

When I use "standard" unsupervised dimensionality reduction techniques such as PCA, the scatter plot seems to have nothing to do with the known labels.

Does what I am looking for have a name? I would like to read some references of solutions.

Best Answer

The most standard linear method of supervised dimensionality reduction is called linear discriminant analysis (LDA). It is designed to find low-dimensional projection that maximizes class separation. You can find a lot of information about it under our tag, and in any machine learning textbook such as e.g. freely available The Elements of Statistical Learning.

Here is a picture that I found here with a quick google search; it shows one-dimensional PCA and LDA projections when there are two classes in the dataset (origin added by me):

PCA vs LDA

Another approach is called partial least squares (PLS). LDA can be interpreted as looking for projections having highest correlation with the dummy variables encoding group labels (in this sense LDA can be seen as a special case of canonical correlation analysis, CCA). In contrast, PLS looks for projections having highest covariance with group labels. Whereas LDA only yields 1 axis for the case of two groups (like on the picture above), PLS will find many axes ordered by the decreasing covariance. Note that when there are more than two groups present in the dataset, there are different "flavours" of PLS that will produce somewhat different results.

Update (2018)

I should find time to expand this answer; this thread seems to be popular but my original answer above is very short and not detailed enough.

In the meantime, I will mention Neighbourhood Components Analysis -- a linear method that finds the projection maximizing $k$-nearest-neighbours classification accuracy. There is a nonlinear generalization using neural networks, see Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure. One can also use neural network classifiers with a bottleneck, see Deep Bottleneck Classifiers in Supervised Dimension Reduction.

Related Question