Solved – Decision boundaries from coefficients of linear discriminants

discriminant analysisr

I have a data set with four variables and 3000+ observations on which I performed an LDA. I was wondering how I can use the scaled coefficients of linear discriminants (output of R shown below as example) to draw decision boundaries in the original variable space?

          LD1      LD2     LD3
[1,]  49.5077  12.3211 20.8351
[2,]  11.3597   9.5139  8.6570
[3,]  39.9696   2.3232  2.8996
[4,] -18.4602 -43.5083  1.1121

Best Answer

You can get the mathematical function describing the discriminant by plugging the mean vectors per class and the sample covariance marix into Fishers discriminant function equation (http://en.m.wikipedia.org/wiki/Linear_discriminant_analysis). The function itself is the decision boundary. Data points are classified based on their location in feature space relative to this function.

However, it would be pretty difficult to draw (graph) a four dimensional surface (4 features). The best you can do is fix 1 or more feature values and then plot the variation of the others for those fixed values.

Related Solutions

Solved – Can the scaling values in a linear discriminant analysis (LDA) be used to plot explanatory variables on the linear discriminants

Principal components analysis and Linear discriminant analysis outputs; iris data.

I will not be drawing biplots because biplots can drawn with various normalizations and therefore may look different. Since I'm not R user I have difficulty to track down how you produced your plots, to repeat them. Instead, I will do PCA and LDA and show the results, in a manner similar to this (you might want to read). Both analyses done in SPSS.

Principal components of iris data:

The analysis will be based on covariances (not correlations) between the 4 variables.

Eigenvalues (component variances) and the proportion of overall variance explained
PC1   4.228241706    .924618723 
PC2    .242670748    .053066483 
PC3    .078209500    .017102610 
PC4    .023835093    .005212184 
# @Etienne's comment: 
# Eigenvalues are obtained in R by
# (princomp(iris[,-5])$sdev)^2 or (prcomp(iris[,-5])$sdev)^2.
# Proportion of variance explained is obtained in R by
# summary(princomp(iris[,-5])) or summary(prcomp(iris[,-5]))

Eigenvectors (cosines of rotation of variables into components)
              PC1           PC2           PC3           PC4
SLength   .3613865918   .6565887713  -.5820298513   .3154871929 
SWidth   -.0845225141   .7301614348   .5979108301  -.3197231037 
PLength   .8566706060  -.1733726628   .0762360758  -.4798389870 
PWidth    .3582891972  -.0754810199   .5458314320   .7536574253    
# @Etienne's comment: 
# This is obtained in R by
# prcomp(iris[,-5])$rotation or princomp(iris[,-5])$loadings

Loadings (eigenvectors normalized to respective eigenvalues;
loadings are the covariances between variables and standardized components)
              PC1           PC2           PC3           PC4
SLength    .743108002    .323446284   -.162770244    .048706863 
SWidth    -.173801015    .359689372    .167211512   -.049360829 
PLength   1.761545107   -.085406187    .021320152   -.074080509 
PWidth     .736738926   -.037183175    .152647008    .116354292    
# @Etienne's comment: 
# Loadings can be obtained in R with
# t(t(princomp(iris[,-5])$loadings) * princomp(iris[,-5])$sdev) or
# t(t(prcomp(iris[,-5])$rotation) * prcomp(iris[,-5])$sdev)

Standardized (rescaled) loadings
(loadings divided by st. deviations of the respective variables)
              PC1           PC2           PC3           PC4
SLength    .897401762     .390604412   -.196566721    .058820016
SWidth    -.398748472     .825228709    .383630296   -.113247642
PLength    .997873942    -.048380599    .012077365   -.041964868
PWidth     .966547516   -.048781602    .200261695    .152648309  

Raw component scores (Centered 4-variable data multiplied by eigenvectors)
     PC1           PC2           PC3           PC4
-2.684125626    .319397247   -.027914828    .002262437 
-2.714141687   -.177001225   -.210464272    .099026550 
-2.888990569   -.144949426    .017900256    .019968390 
-2.745342856   -.318298979    .031559374   -.075575817 
-2.728716537    .326754513    .090079241   -.061258593 
-2.280859633    .741330449    .168677658   -.024200858 
-2.820537751   -.089461385    .257892158   -.048143106 
-2.626144973    .163384960   -.021879318   -.045297871 
-2.886382732   -.578311754    .020759570   -.026744736 
-2.672755798   -.113774246   -.197632725   -.056295401 
... etc.
# @Etienne's comment: 
# This is obtained in R with
# prcomp(iris[,-5])$x or princomp(iris[,-5])$scores.
# Can also be eigenvector normalized for plotting

Standardized (to unit variances) component scores, when multiplied
by loadings return original centered variables.

It is important to stress that it is loadings, not eigenvectors, by which we typically interpret principal components (or factors in factor analysis) - if we need to interpret. Loadings are the regressional coefficients of modeling variables by standardized components. At the same time, because components don't intercorrelate, they are the covariances between such components and the variables. Standardized (rescaled) loadings, like correlations, cannot exceed 1, and are more handy to interpret because the effect of unequal variances of variables is taken off.

It is loadings, not eigenvectors, that are typically displayed on a biplot side-by-side with component scores; the latter are often displayed column-normalized.

Linear discriminants of iris data:

There is 3 classes and 4 variables: min(3-1,4)=2 discriminants can be extracted.
Only the extraction (no classification of data points) will be done.

The Within scatter matrix 
38.95620000   13.63000000   24.62460000    5.64500000 
13.63000000   16.96200000    8.12080000    4.80840000 
24.62460000    8.12080000   27.22260000    6.27180000 
 5.64500000    4.80840000    6.27180000    6.15660000 

The Between scatter matrix 
 63.2121333   -19.9526667   165.2484000    71.2793333 
-19.9526667    11.3449333   -57.2396000   -22.9326667 
165.2484000   -57.2396000   437.1028000   186.7740000 
 71.2793333   -22.9326667   186.7740000    80.4133333

Eigenvalues and canonical correlations
(Canonical correlation squared is SSbetween/SStotal of ANOVA by that discriminant)
Dis1    32.19192920     .98482089 
Dis2      .28539104     .47119702
# @Etienne's comment:
# In R eigenvalues are expected from
# lda(as.factor(Species)~.,data=iris)$svd, but this produces
#   Dis1       Dis2
# 48.642644  4.579983
# @ttnphns' comment:
# The difference might be due to different computational approach
# (e.g. me used eigendecomposition and R used svd?) and is of no importance.
# Canonical correlations though should be the same.

Eigenvectors
              Dis1          Dis2
SLength  -.0684059150   .0019879117 
SWidth   -.1265612055   .1785267025 
PLength   .1815528774  -.0768635659 
PWidth    .2318028594   .2341722673

Eigenvectors (as before, but column-normalized to SS=1: cosines of rotation of variables into discriminants).
              Dis1          Dis2
SLength  -.2087418215   .0065319640 
SWidth   -.3862036868   .5866105531 
PLength   .5540117156  -.2525615400 
PWidth    .7073503964   .7694530921

Unstandardized discriminant coefficients (proportionally related to eigenvectors)
              Dis1          Dis2
SLength   -.829377642    .024102149 
SWidth   -1.534473068   2.164521235 
PLength   2.201211656   -.931921210 
PWidth    2.810460309   2.839187853
# @Etienne's comment:
# This is obtained in R with
# lda(as.factor(Species)~.,data=iris)$scaling
# which is described as being standardized discriminant coefficients in the function definition.

Standardized discriminant coefficients
              Dis1          Dis2
SLength  -.4269548486   .0124075316 
SWidth   -.5212416758   .7352613085 
PLength   .9472572487  -.4010378190 
PWidth    .5751607719   .5810398645

Pooled within-groups correlations between variables and discriminants
              Dis1          Dis2
SLength   .2225959415   .3108117231 
SWidth   -.1190115149   .8636809224 
PLength   .7060653811   .1677013843 
PWidth    .6331779262   .7372420588 

Discriminant scores (Centered 4-variable data multiplied by unstandardized coefficients)
     Dis1           Dis2
-8.061799783    .300420621 
-7.128687721   -.786660426 
-7.489827971   -.265384488 
-6.813200569   -.670631068 
-8.132309326    .514462530 
-7.701946744   1.461720967 
-7.212617624    .355836209 
-7.605293546   -.011633838 
-6.560551593  -1.015163624 
-7.343059893   -.947319209
... etc.
# @Etienne's comment:
# This is obtained in R with
# predict(lda(as.factor(Species)~.,data=iris), iris[,-5])$x

About computations at extraction of discriminants in LDA please look here. We interpret discriminants usually by discriminant coefficients or standardized discriminant coefficients (the latter are more handy because differential variance in variables is taken off). This is like in PCA. But, note: the coefficients here are the regressional coefficients of modeling discriminants by variables, not vice versa, like it was in PCA. Because variables are not uncorrelated, the coefficients cannot be seen as covariances between variables and discriminants.

Yet we have another matrix instead which may serve as an alternative source of interpretation of discriminants - pooled within-group correlations between the discriminants and the variables. Because discriminants are uncorrelated, like PCs, this matrix is in a sense analogous to the standardized loadings of PCA.

In all, while in PCA we have the only matrix - loadings - to help interpret the latents, in LDA we have two alternative matrices for that. If you need to plot (biplot or whatever), you have to decide whether to plot coefficients or correlations.

And, of course, needless to remind that in PCA of iris data the components don't "know" that there are 3 classes; they can't be expected to discriminate classes. Discriminants do "know" there are classes and it is their natural job which is to discriminate.

Solved – What are “coefficients of linear discriminants” in LDA

If you multiply each value of LDA1 (the first linear discriminant) by the corresponding elements of the predictor variables and sum them ($-0.6420190\times$Lag1$+ -0.5135293\times$Lag2) you get a score for each respondent. This score along the the prior are used to compute the posterior probability of class membership (there are a number of different formulas for this). Classification is made based on the posterior probability, with observations predicted to be in the class for which they have the highest probability.

The chart below illustrates the relationship between the score, the posterior probability, and the classification, for the data set used in the question. The basic patterns always holds with two-group LDA: there is 1-to-1 mapping between the scores and the posterior probability, and predictions are equivalent when made from either the posterior probabilities or the scores.

Answers to the sub-questions and some other comments

Although LDA can be used for dimension reduction, this is not what is going on in the example. With two groups, the reason only a single score is required per observation is that this is all that is needed. This is because the probability of being in one group is the complement of the probability of being in the other (i.e., they add to 1). You can see this in the chart: scores of less than -.4 are classified as being in the Down group and higher scores are predicted to be Up.
Sometimes the vector of scores is called a discriminant function. Sometimes the coefficients are called this. I'm not clear on whether either is correct. I believe that MASS discriminant refers to the coefficients.
The MASS package's lda function produces coefficients in a different way to most other LDA software. The alternative approach computes one set of coefficients for each group and each set of coefficients has an intercept. With the discriminant function (scores) computed using these coefficients, classification is based on the highest score and there is no need to compute posterior probabilities in order to predict the classification. I have put some LDA code in GitHub which is a modification of the MASS function but produces these more convenient coefficients (the package is called Displayr/flipMultivariates, and if you create an object using LDA you can extract the coefficients using obj$original$discriminant.functions).
I have posted the R for code all the concepts in this post here.
There is no single formula for computing posterior probabilities from the score. The easiest way to understand the options is (for me anyway) to look at the source code, using:

library(MASS) getAnywhere("predict.lda")

Best Answer

Related Solutions

Solved – Can the scaling values in a linear discriminant analysis (LDA) be used to plot explanatory variables on the linear discriminants

Solved – What are “coefficients of linear discriminants” in LDA

Related Question