Solved – How to use LDA results for feature selection

discriminant analysisfeature selectioninterpretationr

I am working on the Forest type mapping dataset which is available in the UCI machine learning repository.

I have 27 features to predict the 4 types of forest. I am performing a Linear Discriminant Analysis (LDA) to reduce the number of features using lda() function available in the MASS library.

The results of my LDA are as follows –

forest.lda <- lda(class ~.,data=forest)
forest.lda
Call:
lda(class ~ ., data = forest)

Prior probabilities of groups:
   d         h         o         s  
0.2727273 0.2424242 0.1868687 0.2979798 

Group means:
     b1       b2       b3        b4       b5        b6       b7       b8       b9
  d  53.31481 48.31481 68.40741  97.66667 63.18519 103.31481 98.74074  26.07407 56.46296
  h  78.27083 29.39583 55.14583 113.66667 50.22917  95.25000 98.00000 25.10417 60.06250
  o  65.08108 66.54054 87.97297 103.62162 77.70270 118.10811 91.56757 44.18919 77.94595
  s  57.96610 27.79661 51.05085  93.47458 49.67797  91.66102 76.52542 24.28814 55.67797
     pred_minus_obs_H_b1 pred_minus_obs_H_b2 pred_minus_obs_H_b3 pred_minus_obs_H_b4
  d             60.76204            2.723704            28.25963           0.7018519
  h             35.17708           21.101250            40.25167         -15.9608333
  o             48.90054          -15.450270             8.99027          -7.0445946
  s             55.64695           22.945254            44.95712           3.6745763
     pred_minus_obs_H_b5 pred_minus_obs_H_b6 pred_minus_obs_H_b7 pred_minus_obs_H_b8
  d            -37.75296           -42.72981        -21.66481481            3.772407
  h            -25.02354           -35.37042        -21.25583333            4.638958
  o            -52.34162           -57.74297        -15.35891892          -14.684324
  s            -24.41831           -31.63898          0.06610169            5.355085
    pred_minus_obs_H_b9 pred_minus_obs_S_b1 pred_minus_obs_S_b2 pred_minus_obs_S_b3
  d           -0.6970370           -20.12778          -1.1394444           -4.508519
  h           -4.5602083           -19.90458          -0.8512500           -4.302500
  o          -22.5089189           -19.82000          -0.9532432           -4.145946
  s           -0.3098305           -20.19966          -1.0466102           -4.390508
    pred_minus_obs_S_b4 pred_minus_obs_S_b5 pred_minus_obs_S_b6 pred_minus_obs_S_b7
  d            -21.34370          -0.9635185           -4.722407           -19.31796
  h            -21.09187          -1.0172917           -4.687083           -18.68271
  o            -20.55324          -0.8581081           -4.273514           -18.23081
  s            -20.88051          -1.0201695           -4.613898           -18.91254
    pred_minus_obs_S_b8 pred_minus_obs_S_b9
  d            -1.860926           -4.510000
  h            -1.345000           -3.968542
  o            -1.442432           -4.048108
  s            -1.569492           -4.051695

 Coefficients of linear discriminants:
                        LD1          LD2         LD3
 b1                   0.15410941  0.083191904 -0.05574989
 b2                  -0.09033693 -0.098788690 -0.05294103
 b3                   0.01730795  0.034581992  0.17366457
 b4                   0.03142488  0.086065613  0.15504752
 b5                   1.01223928  0.520003333  0.41695043
 b6                  -0.40178858 -0.526621626 -0.48149395
 b7                  -0.23331487 -0.185079579 -0.05010620
 b8                  -0.32959040 -0.218615144 -0.10788013
 b9                   0.13938043  0.179366235  0.15483732
 pred_minus_obs_H_b1 -0.02732135  0.013629388 -0.02773200
 pred_minus_obs_H_b2  0.16743148 -0.086326071 -0.27561332
 pred_minus_obs_H_b3 -0.11530638  0.044265889  0.35503689
 pred_minus_obs_H_b4  0.05370740  0.068117979  0.13440809
 pred_minus_obs_H_b5  1.03718236  0.462674531  0.38562055
 pred_minus_obs_H_b6 -0.40022794 -0.541066832 -0.32075713
 pred_minus_obs_H_b7 -0.19480130 -0.138378132  0.06374322
 pred_minus_obs_H_b8 -0.13609236 -0.003440928  0.09291261
 pred_minus_obs_H_b9  0.01171503 -0.160364856 -0.11694535
 pred_minus_obs_S_b1 -0.05050788  0.019637786 -0.03832580
 pred_minus_obs_S_b2  0.05946038  0.023019484 -0.06508984
 pred_minus_obs_S_b3 -0.05777119 -0.138126136  0.07659433
 pred_minus_obs_S_b4  0.03461031  0.008094415  0.06487418
 pred_minus_obs_S_b5  0.63346845  0.105556436 -0.01382360
 pred_minus_obs_S_b6 -0.30309468 -0.109369091  0.07915327
 pred_minus_obs_S_b7 -0.07614580 -0.053089078  0.01138394
 pred_minus_obs_S_b8 -0.13416272  0.328494630 -0.44248108
 pred_minus_obs_S_b9  0.06414547 -0.347941228  0.43430498

Proportion of trace:
 LD1    LD2    LD3 
 0.7365 0.1971 0.0664

From wiki and other links what I understand is LD1, LD2 and LD3 are functions that I can use to classify the new data (LD1 73.7% and LD2 19.7%). I am not able to interpret how I can use this result to reduce the number of features or select only the relevant features as LD1 and LD2 functions have coefficient for each feature.

I am looking for help on interpreting the results to reduce the number of features from $27$ to some $x<27$.

Best Answer

If it doesn't need to be vanilla LDA (which is not supposed to select from input features), there's e.g. Sparse Discriminant Analysis, which is a LASSO penalized LDA: Line Clemmensen, Trevor Hastie, Daniela Witten, Bjarne Ersbøll: Sparse Discriminant Analysis (2011)

This uses a discrete subset of the input features via the LASSO regularization.

Related Solutions

Solved – Can the scaling values in a linear discriminant analysis (LDA) be used to plot explanatory variables on the linear discriminants

Principal components analysis and Linear discriminant analysis outputs; iris data.

I will not be drawing biplots because biplots can drawn with various normalizations and therefore may look different. Since I'm not R user I have difficulty to track down how you produced your plots, to repeat them. Instead, I will do PCA and LDA and show the results, in a manner similar to this (you might want to read). Both analyses done in SPSS.

Principal components of iris data:

The analysis will be based on covariances (not correlations) between the 4 variables.

Eigenvalues (component variances) and the proportion of overall variance explained
PC1   4.228241706    .924618723 
PC2    .242670748    .053066483 
PC3    .078209500    .017102610 
PC4    .023835093    .005212184 
# @Etienne's comment: 
# Eigenvalues are obtained in R by
# (princomp(iris[,-5])$sdev)^2 or (prcomp(iris[,-5])$sdev)^2.
# Proportion of variance explained is obtained in R by
# summary(princomp(iris[,-5])) or summary(prcomp(iris[,-5]))

Eigenvectors (cosines of rotation of variables into components)
              PC1           PC2           PC3           PC4
SLength   .3613865918   .6565887713  -.5820298513   .3154871929 
SWidth   -.0845225141   .7301614348   .5979108301  -.3197231037 
PLength   .8566706060  -.1733726628   .0762360758  -.4798389870 
PWidth    .3582891972  -.0754810199   .5458314320   .7536574253    
# @Etienne's comment: 
# This is obtained in R by
# prcomp(iris[,-5])$rotation or princomp(iris[,-5])$loadings

Loadings (eigenvectors normalized to respective eigenvalues;
loadings are the covariances between variables and standardized components)
              PC1           PC2           PC3           PC4
SLength    .743108002    .323446284   -.162770244    .048706863 
SWidth    -.173801015    .359689372    .167211512   -.049360829 
PLength   1.761545107   -.085406187    .021320152   -.074080509 
PWidth     .736738926   -.037183175    .152647008    .116354292    
# @Etienne's comment: 
# Loadings can be obtained in R with
# t(t(princomp(iris[,-5])$loadings) * princomp(iris[,-5])$sdev) or
# t(t(prcomp(iris[,-5])$rotation) * prcomp(iris[,-5])$sdev)

Standardized (rescaled) loadings
(loadings divided by st. deviations of the respective variables)
              PC1           PC2           PC3           PC4
SLength    .897401762     .390604412   -.196566721    .058820016
SWidth    -.398748472     .825228709    .383630296   -.113247642
PLength    .997873942    -.048380599    .012077365   -.041964868
PWidth     .966547516   -.048781602    .200261695    .152648309  

Raw component scores (Centered 4-variable data multiplied by eigenvectors)
     PC1           PC2           PC3           PC4
-2.684125626    .319397247   -.027914828    .002262437 
-2.714141687   -.177001225   -.210464272    .099026550 
-2.888990569   -.144949426    .017900256    .019968390 
-2.745342856   -.318298979    .031559374   -.075575817 
-2.728716537    .326754513    .090079241   -.061258593 
-2.280859633    .741330449    .168677658   -.024200858 
-2.820537751   -.089461385    .257892158   -.048143106 
-2.626144973    .163384960   -.021879318   -.045297871 
-2.886382732   -.578311754    .020759570   -.026744736 
-2.672755798   -.113774246   -.197632725   -.056295401 
... etc.
# @Etienne's comment: 
# This is obtained in R with
# prcomp(iris[,-5])$x or princomp(iris[,-5])$scores.
# Can also be eigenvector normalized for plotting

Standardized (to unit variances) component scores, when multiplied
by loadings return original centered variables.

It is important to stress that it is loadings, not eigenvectors, by which we typically interpret principal components (or factors in factor analysis) - if we need to interpret. Loadings are the regressional coefficients of modeling variables by standardized components. At the same time, because components don't intercorrelate, they are the covariances between such components and the variables. Standardized (rescaled) loadings, like correlations, cannot exceed 1, and are more handy to interpret because the effect of unequal variances of variables is taken off.

It is loadings, not eigenvectors, that are typically displayed on a biplot side-by-side with component scores; the latter are often displayed column-normalized.

Linear discriminants of iris data:

There is 3 classes and 4 variables: min(3-1,4)=2 discriminants can be extracted.
Only the extraction (no classification of data points) will be done.

The Within scatter matrix 
38.95620000   13.63000000   24.62460000    5.64500000 
13.63000000   16.96200000    8.12080000    4.80840000 
24.62460000    8.12080000   27.22260000    6.27180000 
 5.64500000    4.80840000    6.27180000    6.15660000 

The Between scatter matrix 
 63.2121333   -19.9526667   165.2484000    71.2793333 
-19.9526667    11.3449333   -57.2396000   -22.9326667 
165.2484000   -57.2396000   437.1028000   186.7740000 
 71.2793333   -22.9326667   186.7740000    80.4133333

Eigenvalues and canonical correlations
(Canonical correlation squared is SSbetween/SStotal of ANOVA by that discriminant)
Dis1    32.19192920     .98482089 
Dis2      .28539104     .47119702
# @Etienne's comment:
# In R eigenvalues are expected from
# lda(as.factor(Species)~.,data=iris)$svd, but this produces
#   Dis1       Dis2
# 48.642644  4.579983
# @ttnphns' comment:
# The difference might be due to different computational approach
# (e.g. me used eigendecomposition and R used svd?) and is of no importance.
# Canonical correlations though should be the same.

Eigenvectors
              Dis1          Dis2
SLength  -.0684059150   .0019879117 
SWidth   -.1265612055   .1785267025 
PLength   .1815528774  -.0768635659 
PWidth    .2318028594   .2341722673

Eigenvectors (as before, but column-normalized to SS=1: cosines of rotation of variables into discriminants).
              Dis1          Dis2
SLength  -.2087418215   .0065319640 
SWidth   -.3862036868   .5866105531 
PLength   .5540117156  -.2525615400 
PWidth    .7073503964   .7694530921

Unstandardized discriminant coefficients (proportionally related to eigenvectors)
              Dis1          Dis2
SLength   -.829377642    .024102149 
SWidth   -1.534473068   2.164521235 
PLength   2.201211656   -.931921210 
PWidth    2.810460309   2.839187853
# @Etienne's comment:
# This is obtained in R with
# lda(as.factor(Species)~.,data=iris)$scaling
# which is described as being standardized discriminant coefficients in the function definition.

Standardized discriminant coefficients
              Dis1          Dis2
SLength  -.4269548486   .0124075316 
SWidth   -.5212416758   .7352613085 
PLength   .9472572487  -.4010378190 
PWidth    .5751607719   .5810398645

Pooled within-groups correlations between variables and discriminants
              Dis1          Dis2
SLength   .2225959415   .3108117231 
SWidth   -.1190115149   .8636809224 
PLength   .7060653811   .1677013843 
PWidth    .6331779262   .7372420588 

Discriminant scores (Centered 4-variable data multiplied by unstandardized coefficients)
     Dis1           Dis2
-8.061799783    .300420621 
-7.128687721   -.786660426 
-7.489827971   -.265384488 
-6.813200569   -.670631068 
-8.132309326    .514462530 
-7.701946744   1.461720967 
-7.212617624    .355836209 
-7.605293546   -.011633838 
-6.560551593  -1.015163624 
-7.343059893   -.947319209
... etc.
# @Etienne's comment:
# This is obtained in R with
# predict(lda(as.factor(Species)~.,data=iris), iris[,-5])$x

About computations at extraction of discriminants in LDA please look here. We interpret discriminants usually by discriminant coefficients or standardized discriminant coefficients (the latter are more handy because differential variance in variables is taken off). This is like in PCA. But, note: the coefficients here are the regressional coefficients of modeling discriminants by variables, not vice versa, like it was in PCA. Because variables are not uncorrelated, the coefficients cannot be seen as covariances between variables and discriminants.

Yet we have another matrix instead which may serve as an alternative source of interpretation of discriminants - pooled within-group correlations between the discriminants and the variables. Because discriminants are uncorrelated, like PCs, this matrix is in a sense analogous to the standardized loadings of PCA.

In all, while in PCA we have the only matrix - loadings - to help interpret the latents, in LDA we have two alternative matrices for that. If you need to plot (biplot or whatever), you have to decide whether to plot coefficients or correlations.

And, of course, needless to remind that in PCA of iris data the components don't "know" that there are 3 classes; they can't be expected to discriminate classes. Discriminants do "know" there are classes and it is their natural job which is to discriminate.

Dimensionality Reduction – Using LDA for Dimensionality Reduction

LDA used as a dimensionality-reducing technique can be seen as a "supervised PCA", so it will redistribute your data in a new space (of lesser dimension) where classes should be better separated (based on the labels you provided).

The projection matrix is made of the first eigen vectors (of positive eigen values) given by LDA to project your test data into that new feature space, then input that vector into your SVM.

Note that you should use a non-linear kernel in your SVM (e.g. RBF), otherwise you'll have a linear transformation on top of another linear transformation, which will not improve discrimination. SVM and LDA are pretty much equivalent when it comes to linear classification.

Best Answer

Related Solutions

Solved – Can the scaling values in a linear discriminant analysis (LDA) be used to plot explanatory variables on the linear discriminants

Dimensionality Reduction – Using LDA for Dimensionality Reduction

Related Question