In R
, I use lda
function from library MASS
to do classification. As I understand LDA, input $x$ will be assigned label $y$, which maximize $p(y|x)$, right?
But when I fit the model, in which $$x=(Lag1,Lag2)$$$$y=Direction,$$ I don't quite understand the output from lda
,
Edit: to reproduce the output below, first run:
library(MASS)
library(ISLR)
train = subset(Smarket, Year < 2005)
lda.fit = lda(Direction ~ Lag1 + Lag2, data = train)
> lda.fit Call: lda(Direction ~ Lag1 + Lag2, data = train) Prior probabilities of groups: Down Up 0.491984 0.508016 Group means: Lag1 Lag2 Down 0.04279022 0.03389409 Up -0.03954635 -0.03132544 Coefficients of linear discriminants: LD1 Lag1 -0.6420190 Lag2 -0.5135293
I understand all the info in the above output but one thing, what is LD1
? I search the web for it, is it linear discriminant score? What is that and why do I need it?
UPDATE
I read several posts (such as this and this one) and also search the web for DA, and now here is what I think about DA or LDA.
-
It can be used to do classification, and when this is the purpose, I can use the Bayes approach, that is, compute the posterior $p(y|x)$ for each class $y_i$, and then classify $x$ to the class with the highest posterior. By this approach, I don't need to find out the discriminants at all, right?
-
As I read in the posts, DA or at least LDA is primarily aimed at dimensionality reduction, for $K$ classes and $D$-dim predictor space, I can project the $D$-dim $x$ into a new $(K-1)$-dim feature space $z$, that is, \begin{align*}x&=(x_1,…,x_D)\\z&=(z_1,…,z_{K-1})\\z_i&=w_i^Tx\end{align*}, $z$ can be seen as the transformed feature vector from the original $x$, and each $w_i$ is the vector on which $x$ is projected.
Am I right about the above statements? If yes, I have following questions:
-
What is a discriminant? Is each entry $z_i$ in vector $z$ is a discriminant? Or $w_i$?
-
How to do classification using discriminants?
Best Answer
If you multiply each value of
LDA1
(the first linear discriminant) by the corresponding elements of the predictor variables and sum them ($-0.6420190\times$Lag1
$+ -0.5135293\times$Lag2
) you get a score for each respondent. This score along the the prior are used to compute the posterior probability of class membership (there are a number of different formulas for this). Classification is made based on the posterior probability, with observations predicted to be in the class for which they have the highest probability.The chart below illustrates the relationship between the score, the posterior probability, and the classification, for the data set used in the question. The basic patterns always holds with two-group LDA: there is 1-to-1 mapping between the scores and the posterior probability, and predictions are equivalent when made from either the posterior probabilities or the scores.
Answers to the sub-questions and some other comments
Although LDA can be used for dimension reduction, this is not what is going on in the example. With two groups, the reason only a single score is required per observation is that this is all that is needed. This is because the probability of being in one group is the complement of the probability of being in the other (i.e., they add to 1). You can see this in the chart: scores of less than -.4 are classified as being in the Down group and higher scores are predicted to be Up.
Sometimes the vector of scores is called a
discriminant function
. Sometimes the coefficients are called this. I'm not clear on whether either is correct. I believe that MASSdiscriminant
refers to the coefficients.The MASS package's
lda
function produces coefficients in a different way to most other LDA software. The alternative approach computes one set of coefficients for each group and each set of coefficients has an intercept. With the discriminant function (scores) computed using these coefficients, classification is based on the highest score and there is no need to compute posterior probabilities in order to predict the classification. I have put some LDA code in GitHub which is a modification of theMASS
function but produces these more convenient coefficients (the package is calledDisplayr/flipMultivariates
, and if you create an object usingLDA
you can extract the coefficients usingobj$original$discriminant.functions
).I have posted the R for code all the concepts in this post here.
library(MASS) getAnywhere("predict.lda")