Solved – Identical variable importance values for different model types

caretcategorical dataclassificationfeature selectionr

I trained two different caret models on the same multi-class training data using repeated cross validation and computed the variable importance. What strikes me, is that for both models varImp returns exactly the same variable importance values for both model types.

This is what I see. The way I come to this result is shown further below:

Variable Importance for k-Nearest Neighbors:

> varImp(models$knn)

ROC curve variable importance

  variables are sorted by maximum importance across the classes
     Class_1 Class_2 Class_3 Class_4 Class_5
V5    59.24  100.00   94.89   89.17  100.00
V9    98.50   96.92   98.50   68.75   87.50
V3    13.47   36.09   96.48   96.48   79.93
V4    96.39   93.22   96.39   35.92    0.00
V8    55.19   55.19   74.91   94.98   94.98
V10   68.40   81.16   83.63   83.63   83.01
V6    15.23   44.89   71.21   78.79   78.79
V7    72.89   70.07   73.77   58.27   73.77
V2    31.51   72.62   72.62   54.14   25.09
V11   51.41   67.87   67.87   46.92   50.79
U2    38.47   40.45   38.12   40.45   34.42
U3    33.71   33.63   39.70   39.13   39.70
U5    33.23   38.64   38.64   36.58   31.95
U1    31.87   34.86   33.98   35.56   35.56
U4    30.28   34.37   33.63   34.37   33.89

Variable Importance for Penalized Discriminant Analysis

> varImp(models$pda)
ROC curve variable importance

  variables are sorted by maximum importance across the classes
    Class_1 Class_2 Class_3 Class_4 Class_5
V5    59.24  100.00   94.89   89.17  100.00
V9    98.50   96.92   98.50   68.75   87.50
V3    13.47   36.09   96.48   96.48   79.93
V4    96.39   93.22   96.39   35.92    0.00
V8    55.19   55.19   74.91   94.98   94.98
V10   68.40   81.16   83.63   83.63   83.01
V6    15.23   44.89   71.21   78.79   78.79
V7    72.89   70.07   73.77   58.27   73.77
V2    31.51   72.62   72.62   54.14   25.09
V11   51.41   67.87   67.87   46.92   50.79
U2    38.47   40.45   38.12   40.45   34.42
U3    33.71   33.63   39.70   39.13   39.70
U5    33.23   38.64   38.64   36.58   31.95
U1    31.87   34.86   33.98   35.56   35.56
U4    30.28   34.37   33.63   34.37   33.89

Parameter Definition and Model Training

Here is the way I trained my models:

library(caret)

inputMatrix <- read.csv("data/inputMatrix.csv", stringsAsFactors = TRUE)
cvMethod  <- "repeatedcv" # cross validation method
folds     <- 5 # number of folds
repeats   <- 10 # number of complete sets of folds
trainPerc <- .80 # percentage of training split
metric    <- "Accuracy"
summaryFunction <- defaultSummary

## Parameters for cross-validation and result summary
ctrl <- trainControl(
  classProbs = TRUE,
  method = cvMethod, # cross-validation method
  number = folds,    # number of folds
  repeats = repeats, # number of complete sets of folds
  summaryFunction = summaryFunction,
  savePredictions = TRUE,
  allowParallel = TRUE)

## Defining tuning parameters for each model
tuneGrids <- list(knn=expand.grid(k=runif(10,10,40)),
                  pda=expand.grid(lambda=runif(10,0,20)))

## Splitting input matrix into feature matrix and label vector
X <- inputMatrix[, 2:ncol(inputMatrix)]
Y <- as.factor(inputMatrix[, 1])

## Defining training-/testset split
partition <- createDataPartition(Y, p=trainPerc, list=FALSE)

trainData    <- X[partition, ] 
trainClasses <- Y[partition]
testData     <- X[-partition, ]
testClasses  <- Y[-partition]

## Train models
models <- list(knn=train(trainData, trainClasses, 
                         method = "knn", 
                         tuneGrid = tuneGrids$knn, 
                         trControl = ctrl,
                         metric = metric),
               pda=train(trainData, trainClasses, 
                         method = "pda", 
                         tuneGrid = tuneGrids$pda,
                         trControl = ctrl,
                         metric = metric))

Training Results

When I look at the model summary, I can see that they are in fact the two I've trained:

> models$knn

k-Nearest Neighbors 

200 samples
 15 predictors
  5 classes: 'Class_1', 'Class_2', 'Class_3', 'Class_4', 'Class_5' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 10 times) 
Summary of sample sizes: 160, 160, 160, 160, 160, 160, ... 
Resampling results across tuning parameters:

  k         Accuracy  Kappa   
  11.98987  0.8665    0.833125
  18.88380  0.8820    0.852500
  20.05730  0.8855    0.856875
  28.94440  0.8880    0.860000
  29.71238  0.8915    0.864375
  30.49805  0.8905    0.863125
  32.22957  0.8910    0.863750
  36.00965  0.8935    0.866875
  38.68261  0.8965    0.870625
  39.66626  0.8955    0.869375

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was k = 38.68261. 

> models$pda

Penalized Discriminant Analysis 

200 samples
 15 predictors
  5 classes: 'Class_1', 'Class_2', 'Class_3', 'Class_4', 'Class_5' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 10 times) 
Summary of sample sizes: 160, 160, 160, 160, 160, 160, ... 
Resampling results across tuning parameters:

  lambda     Accuracy  Kappa   
   6.220596  0.9140    0.892500
   6.904289  0.9135    0.891875
   7.538826  0.9135    0.891875
   7.787587  0.9135    0.891875
  12.298270  0.9135    0.891875
  13.870132  0.9140    0.892500
  15.338699  0.9145    0.893125
  16.059969  0.9145    0.893125
  16.104978  0.9145    0.893125
  17.066850  0.9145    0.893125

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was lambda = 15.3387. 

Session Info for Reproducibility

> sessionInfo()

R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.5 (Santiago)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mda_0.4-8       class_7.3-14    caret_6.0-70    ggplot2_2.1.0   lattice_0.20-33

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.5        magrittr_1.5       splines_3.2.3      MASS_7.3-45        munsell_0.4.2      colorspace_1.2-6  
 [7] foreach_1.4.3      minqa_1.2.4        stringr_1.0.0      car_2.1-1          plyr_1.8.3         tools_3.2.3       
[13] nnet_7.3-11        pbkrtest_0.4-4     parallel_3.2.3     grid_3.2.3         gtable_0.1.2       nlme_3.1-122      
[19] mgcv_1.8-10        quantreg_5.19      e1071_1.6-7        MatrixModels_0.4-1 iterators_1.0.8    lme4_1.1-10       
[25] Matrix_1.2-3       nloptr_1.0.4       reshape2_1.4.1     codetools_0.2-14   stringi_1.0-1      compiler_3.2.3    
[31] scales_0.3.0       stats4_3.2.3       SparseM_1.7       

I can't wrap my head around, why that happens. Is this to be expected in some way, or do I miss something? This behavior is not limited to knn and pda, but can be seen for each of the variable importance calculations, that are not based on a model specific metric.

Thanks for your help!

Best Answer

Alright, looking into the code of varImp.train I see, that in case of a classification problem, the variable importance is just computed via the filterVarImp function. So the variables are just ranked by the AUC, as stated in the documentation varImp under model independent metrics.

I tested it, by calling varImp on each of my models and comparing the variable importance values with the ones computed via filterVarImp on the training data.

## Compute variable importance via filter approach
varImps.filtered <- filterVarImp(trainData, trainClasses)
varImps <- list(knn=varImp(models$knn, scale=F),
                pda=varImp(models$pda, scale=F))

## Sort variable importance by their average value 
## over all classes in decreasing order.
varImps.filtered$Mean <- apply(varImps.filtered, 1, mean)
varImps.filtered <- varImps.filtered[with(varImps.filtered, order(-Mean)), ]
varImps.filtered$Mean <- NULL

... and surprise surprise, it is exactly the same.

> varImps$knn

ROC curve variable importance

  variables are sorted by maximum importance across the classes
    Class_1 Class_2 Class_3 Class_4 Class_5
V5   0.7094  0.9912  0.9431  0.9231  0.9912
V3   0.3706  0.5631  0.9744  0.9744  0.7831
V9   0.9725  0.9619  0.9725  0.8125  0.8988
V8   0.6887  0.6644  0.8650  0.9531  0.9531
V4   0.9325  0.9194  0.9325  0.6044  0.3138
V10  0.7250  0.8119  0.8544  0.8544  0.8331
V7   0.8169  0.7606  0.8244  0.7025  0.8244
V6   0.3650  0.5775  0.7838  0.8081  0.8081
V11  0.6194  0.7662  0.7662  0.6000  0.6506
V2   0.5138  0.7412  0.7412  0.5938  0.4031
U5   0.5609  0.5731  0.5731  0.4944  0.4834
U4   0.5259  0.5531  0.5531  0.5103  0.5109
U3   0.5134  0.5134  0.5103  0.5384  0.5384
U2   0.5384  0.5203  0.5216  0.5384  0.5219
U1   0.4853  0.5312  0.5312  0.5238  0.4872

> varImps.filtered
      Class_1   Class_2   Class_3   Class_4   Class_5
V9  0.9725000 0.9618750 0.9725000 0.8125000 0.8987500
V5  0.7093750 0.9912500 0.9431250 0.9231250 0.9912500
V8  0.6887500 0.6643750 0.8650000 0.9531250 0.9531250
V10 0.7250000 0.8118750 0.8543750 0.8543750 0.8331250
V7  0.8168750 0.7606250 0.8243750 0.7025000 0.8243750
V4  0.9325000 0.9193750 0.9325000 0.6043750 0.3137500
V3  0.3706250 0.5631250 0.9743750 0.9743750 0.7831250
V11 0.6193750 0.7662500 0.7662500 0.6000000 0.6506250
V6  0.3650000 0.5775000 0.7837500 0.8081250 0.8081250
V2  0.5137500 0.7412500 0.7412500 0.5937500 0.4031250
U5  0.5609375 0.5731250 0.5731250 0.4943750 0.4834375
U4  0.5259375 0.5531250 0.5531250 0.5103125 0.5109375
U2  0.5384375 0.5203125 0.5215625 0.5384375 0.5218750
U3  0.5134375 0.5134375 0.5103125 0.5384375 0.5384375
U1  0.4853125 0.5312500 0.5312500 0.5237500 0.4871875

My goal was to come up with a model specific stable feature selection method. The only way I see to achieve this now, is by utilizing caret's inbuilt feature selection methods like "Recursive Feature Selection (RFE)" and "Selection by Filter (SBF)". As far as I understand it, RFE however is only supporting a handful of models in caret. So I might be forced to implement the rfeControl$functions myself.

Related Question