Solved – What are “coefficients of linear discriminants” in LDA

discriminant analysisinferencer

In R, I use lda function from library MASS to do classification. As I understand LDA, input $x$ will be assigned label $y$, which maximize $p(y|x)$, right?

But when I fit the model, in which $$x=(Lag1,Lag2)$$$$y=Direction,$$ I don't quite understand the output from lda,

Edit: to reproduce the output below, first run:

library(MASS)
library(ISLR)

train = subset(Smarket, Year < 2005)

lda.fit = lda(Direction ~ Lag1 + Lag2, data = train)
> lda.fit
Call:
lda(Direction ~ Lag1 + Lag2, data = train)

Prior probabilities of groups:
    Down       Up 
0.491984 0.508016 

Group means:
            Lag1        Lag2
Down  0.04279022  0.03389409
Up   -0.03954635 -0.03132544

Coefficients of linear discriminants:
            LD1
Lag1 -0.6420190
Lag2 -0.5135293

I understand all the info in the above output but one thing, what is LD1? I search the web for it, is it linear discriminant score? What is that and why do I need it?

UPDATE

I read several posts (such as this and this one) and also search the web for DA, and now here is what I think about DA or LDA.

  1. It can be used to do classification, and when this is the purpose, I can use the Bayes approach, that is, compute the posterior $p(y|x)$ for each class $y_i$, and then classify $x$ to the class with the highest posterior. By this approach, I don't need to find out the discriminants at all, right?

  2. As I read in the posts, DA or at least LDA is primarily aimed at dimensionality reduction, for $K$ classes and $D$-dim predictor space, I can project the $D$-dim $x$ into a new $(K-1)$-dim feature space $z$, that is, \begin{align*}x&=(x_1,…,x_D)\\z&=(z_1,…,z_{K-1})\\z_i&=w_i^Tx\end{align*}, $z$ can be seen as the transformed feature vector from the original $x$, and each $w_i$ is the vector on which $x$ is projected.

Am I right about the above statements? If yes, I have following questions:

  1. What is a discriminant? Is each entry $z_i$ in vector $z$ is a discriminant? Or $w_i$?

  2. How to do classification using discriminants?

Best Answer

If you multiply each value of LDA1 (the first linear discriminant) by the corresponding elements of the predictor variables and sum them ($-0.6420190\times$Lag1$+ -0.5135293\times$Lag2) you get a score for each respondent. This score along the the prior are used to compute the posterior probability of class membership (there are a number of different formulas for this). Classification is made based on the posterior probability, with observations predicted to be in the class for which they have the highest probability.

The chart below illustrates the relationship between the score, the posterior probability, and the classification, for the data set used in the question. The basic patterns always holds with two-group LDA: there is 1-to-1 mapping between the scores and the posterior probability, and predictions are equivalent when made from either the posterior probabilities or the scores.

Score, Posterior Probability, Classification

Answers to the sub-questions and some other comments

  • Although LDA can be used for dimension reduction, this is not what is going on in the example. With two groups, the reason only a single score is required per observation is that this is all that is needed. This is because the probability of being in one group is the complement of the probability of being in the other (i.e., they add to 1). You can see this in the chart: scores of less than -.4 are classified as being in the Down group and higher scores are predicted to be Up.

  • Sometimes the vector of scores is called a discriminant function. Sometimes the coefficients are called this. I'm not clear on whether either is correct. I believe that MASS discriminant refers to the coefficients.

  • The MASS package's lda function produces coefficients in a different way to most other LDA software. The alternative approach computes one set of coefficients for each group and each set of coefficients has an intercept. With the discriminant function (scores) computed using these coefficients, classification is based on the highest score and there is no need to compute posterior probabilities in order to predict the classification. I have put some LDA code in GitHub which is a modification of the MASS function but produces these more convenient coefficients (the package is called Displayr/flipMultivariates, and if you create an object using LDA you can extract the coefficients using obj$original$discriminant.functions).

  • I have posted the R for code all the concepts in this post here.

  • There is no single formula for computing posterior probabilities from the score. The easiest way to understand the options is (for me anyway) to look at the source code, using:

library(MASS) getAnywhere("predict.lda")