Solved – How to use LDA results for feature selection

discriminant analysisfeature selectioninterpretationr

I am working on the Forest type mapping dataset which is available in the UCI machine learning repository.

I have 27 features to predict the 4 types of forest. I am performing a Linear Discriminant Analysis (LDA) to reduce the number of features using lda() function available in the MASS library.

The results of my LDA are as follows –

forest.lda <- lda(class ~.,data=forest)
forest.lda
Call:
lda(class ~ ., data = forest)

Prior probabilities of groups:
   d         h         o         s  
0.2727273 0.2424242 0.1868687 0.2979798 

Group means:
     b1       b2       b3        b4       b5        b6       b7       b8       b9
  d  53.31481 48.31481 68.40741  97.66667 63.18519 103.31481 98.74074  26.07407 56.46296
  h  78.27083 29.39583 55.14583 113.66667 50.22917  95.25000 98.00000 25.10417 60.06250
  o  65.08108 66.54054 87.97297 103.62162 77.70270 118.10811 91.56757 44.18919 77.94595
  s  57.96610 27.79661 51.05085  93.47458 49.67797  91.66102 76.52542 24.28814 55.67797
     pred_minus_obs_H_b1 pred_minus_obs_H_b2 pred_minus_obs_H_b3 pred_minus_obs_H_b4
  d             60.76204            2.723704            28.25963           0.7018519
  h             35.17708           21.101250            40.25167         -15.9608333
  o             48.90054          -15.450270             8.99027          -7.0445946
  s             55.64695           22.945254            44.95712           3.6745763
     pred_minus_obs_H_b5 pred_minus_obs_H_b6 pred_minus_obs_H_b7 pred_minus_obs_H_b8
  d            -37.75296           -42.72981        -21.66481481            3.772407
  h            -25.02354           -35.37042        -21.25583333            4.638958
  o            -52.34162           -57.74297        -15.35891892          -14.684324
  s            -24.41831           -31.63898          0.06610169            5.355085
    pred_minus_obs_H_b9 pred_minus_obs_S_b1 pred_minus_obs_S_b2 pred_minus_obs_S_b3
  d           -0.6970370           -20.12778          -1.1394444           -4.508519
  h           -4.5602083           -19.90458          -0.8512500           -4.302500
  o          -22.5089189           -19.82000          -0.9532432           -4.145946
  s           -0.3098305           -20.19966          -1.0466102           -4.390508
    pred_minus_obs_S_b4 pred_minus_obs_S_b5 pred_minus_obs_S_b6 pred_minus_obs_S_b7
  d            -21.34370          -0.9635185           -4.722407           -19.31796
  h            -21.09187          -1.0172917           -4.687083           -18.68271
  o            -20.55324          -0.8581081           -4.273514           -18.23081
  s            -20.88051          -1.0201695           -4.613898           -18.91254
    pred_minus_obs_S_b8 pred_minus_obs_S_b9
  d            -1.860926           -4.510000
  h            -1.345000           -3.968542
  o            -1.442432           -4.048108
  s            -1.569492           -4.051695

 Coefficients of linear discriminants:
                        LD1          LD2         LD3
 b1                   0.15410941  0.083191904 -0.05574989
 b2                  -0.09033693 -0.098788690 -0.05294103
 b3                   0.01730795  0.034581992  0.17366457
 b4                   0.03142488  0.086065613  0.15504752
 b5                   1.01223928  0.520003333  0.41695043
 b6                  -0.40178858 -0.526621626 -0.48149395
 b7                  -0.23331487 -0.185079579 -0.05010620
 b8                  -0.32959040 -0.218615144 -0.10788013
 b9                   0.13938043  0.179366235  0.15483732
 pred_minus_obs_H_b1 -0.02732135  0.013629388 -0.02773200
 pred_minus_obs_H_b2  0.16743148 -0.086326071 -0.27561332
 pred_minus_obs_H_b3 -0.11530638  0.044265889  0.35503689
 pred_minus_obs_H_b4  0.05370740  0.068117979  0.13440809
 pred_minus_obs_H_b5  1.03718236  0.462674531  0.38562055
 pred_minus_obs_H_b6 -0.40022794 -0.541066832 -0.32075713
 pred_minus_obs_H_b7 -0.19480130 -0.138378132  0.06374322
 pred_minus_obs_H_b8 -0.13609236 -0.003440928  0.09291261
 pred_minus_obs_H_b9  0.01171503 -0.160364856 -0.11694535
 pred_minus_obs_S_b1 -0.05050788  0.019637786 -0.03832580
 pred_minus_obs_S_b2  0.05946038  0.023019484 -0.06508984
 pred_minus_obs_S_b3 -0.05777119 -0.138126136  0.07659433
 pred_minus_obs_S_b4  0.03461031  0.008094415  0.06487418
 pred_minus_obs_S_b5  0.63346845  0.105556436 -0.01382360
 pred_minus_obs_S_b6 -0.30309468 -0.109369091  0.07915327
 pred_minus_obs_S_b7 -0.07614580 -0.053089078  0.01138394
 pred_minus_obs_S_b8 -0.13416272  0.328494630 -0.44248108
 pred_minus_obs_S_b9  0.06414547 -0.347941228  0.43430498

Proportion of trace:
 LD1    LD2    LD3 
 0.7365 0.1971 0.0664 

From wiki and other links what I understand is LD1, LD2 and LD3 are functions that I can use to classify the new data (LD1 73.7% and LD2 19.7%). I am not able to interpret how I can use this result to reduce the number of features or select only the relevant features as LD1 and LD2 functions have coefficient for each feature.

I am looking for help on interpreting the results to reduce the number of features from $27$ to some $x<27$.

Best Answer

If it doesn't need to be vanilla LDA (which is not supposed to select from input features), there's e.g. Sparse Discriminant Analysis, which is a LASSO penalized LDA: Line Clemmensen, Trevor Hastie, Daniela Witten, Bjarne Ersbøll: Sparse Discriminant Analysis (2011)

This uses a discrete subset of the input features via the LASSO regularization.

Related Question