Discriminant Analysis in Machine Learning – Is It Supervised Learning?

discriminant analysismachine learningterminology

Is linear discriminant analysis, specifically Linear Programming Discriminant Analysis (LPDA), supervised learning? Can you provide a valid reference that states so if possible.

My study supervisor and I have been disagreeing about it. I'm convinced linear discriminant analysis, whether Fisher LDA or LPDA, is supervised learning. Both techniques use a labelled set of objects to derive a function which can be used to predict class labels for unlabelled objects.

My study supervisor does not agree, stating that nothing is "learned" when using discriminant analysis.

Best Answer

As you say, LDA is supervised. How does your supervisor define "learning"?

But yes, usually it is counted as supervised learning. Reference, e.g. first 2 pages of The Elements of Statistical Learning

You can use LDA models for prediction of new cases. (I'd say that implies that something has been learned
However, you can also put emphasis on the projection aspect, which may be used in a descriptive rather than a predictive way.

I think we wrote something here.

Related Solutions

Classification – Combining PCA and LDA: Does It Make Sense?

Summary: PCA can be performed before LDA to regularize the problem and avoid over-fitting.

Recall that LDA projections are computed via eigendecomposition of $\boldsymbol \Sigma_W^{-1} \boldsymbol \Sigma_B$, where $\boldsymbol \Sigma_W$ and $\boldsymbol \Sigma_B$ are within- and between-class covariance matrices. If there are less than $N$ data points (where $N$ is the dimensionality of your space, i.e. the number of features/variables), then $\boldsymbol \Sigma_W$ will be singular and therefore cannot be inverted. In this case there is simply no way to perform LDA directly, but if one applies PCA first, it will work. @Aaron made this remark in the comments to his reply, and I agree with that (but disagree with his answer in general, as you will see now).

However, this is only part of the problem. The bigger picture is that LDA very easily tends to overfit the data. Note that within-class covariance matrix gets inverted in the LDA computations; for high-dimensional matrices inversion is a really sensitive operation that can only be reliably done if the estimate of $\boldsymbol \Sigma_W$ is really good. But in high dimensions $N \gg 1$, it is really difficult to obtain a precise estimate of $\boldsymbol \Sigma_W$, and in practice one often has to have a lot more than $N$ data points to start hoping that the estimate is good. Otherwise $\boldsymbol \Sigma_W$ will be almost-singular (i.e. some of the eigenvalues will be very low), and this will cause over-fitting, i.e. near-perfect class separation on the training data with chance performance on the test data.

To tackle this issue, one needs to regularize the problem. One way to do it is to use PCA to reduce dimensionality first. There are other, arguably better ones, e.g. regularized LDA (rLDA) method which simply uses $(1-\lambda)\boldsymbol \Sigma_W + \lambda \boldsymbol I$ with small $\lambda$ instead of $\boldsymbol \Sigma_W$ (this is called shrinkage estimator), but doing PCA first is conceptually the simplest approach and often works just fine.

Illustration

Here is an illustration of the over-fitting problem. I generated 60 samples per class in 3 classes from standard Gaussian distribution (mean zero, unit variance) in 10-, 50-, 100-, and 150-dimensional spaces, and applied LDA to project the data on 2D:

Overfitting in LDA

Note how as the dimensionality grows, classes become better and better separated, whereas in reality there is no difference between the classes.

We can see how PCA helps to prevent the overfitting if we make classes slightly separated. I added 1 to the first coordinate of the first class, 2 to the first coordinate of the second class, and 3 to the first coordinate of the third class. Now they are slightly separated, see top left subplot:

Overfitting in LDA and regularization with PCA

Overfitting (top row) is still obvious. But if I pre-process the data with PCA, always keeping 10 dimensions (bottom row), overfitting disappears while the classes remain near-optimally separated.

PS. To prevent misunderstandings: I am not claiming that PCA+LDA is a good regularization strategy (on the contrary, I would advice to use rLDA), I am simply demonstrating that it is a possible strategy.

Update. Very similar topic has been previously discussed in the following threads with interesting and comprehensive answers provided by @cbeleites:

See also this question with some good answers:

What can cause PCA to worsen results of a classifier?

Solved – What learning occurs in linear discriminant analysis

The paper you're reading is describing Fisher's linear discriminant and the MATLAB code is actually implementing LDA that assumes a multivariate normal distribution.

Take a look at this link for a more thorough description but mainly the part that is confusing you ($\vec{G}$) is calculated here:

Temp = GroupMean(i,:) / PooledCov;

% Constant
W(i,1) = -0.5 * Temp * GroupMean(i,:)' + log(PriorProb(i));

and corresponds to the fairly standard maximum likelihood estimation of the multivariate normal (page 7 in the slides).

Just to be clear. Fisher's linear discriminant and LDA are equivalent (assuming LDA's assumptions are satisfied) in that both will give you the same projection.

UPDATE: Actually, Wikipedia offers an overview of both approaches.

Best Answer

Related Solutions

Classification – Combining PCA and LDA: Does It Make Sense?

Illustration

Solved – What learning occurs in linear discriminant analysis

Related Question