I trained two different caret models on the same multi-class training data using repeated cross validation and computed the variable importance. What strikes me, is that for both models varImp returns exactly the same variable importance values for both model types.
This is what I see. The way I come to this result is shown further below:
Variable Importance for k-Nearest Neighbors:
> varImp(models$knn)
ROC curve variable importance
variables are sorted by maximum importance across the classes
Class_1 Class_2 Class_3 Class_4 Class_5
V5 59.24 100.00 94.89 89.17 100.00
V9 98.50 96.92 98.50 68.75 87.50
V3 13.47 36.09 96.48 96.48 79.93
V4 96.39 93.22 96.39 35.92 0.00
V8 55.19 55.19 74.91 94.98 94.98
V10 68.40 81.16 83.63 83.63 83.01
V6 15.23 44.89 71.21 78.79 78.79
V7 72.89 70.07 73.77 58.27 73.77
V2 31.51 72.62 72.62 54.14 25.09
V11 51.41 67.87 67.87 46.92 50.79
U2 38.47 40.45 38.12 40.45 34.42
U3 33.71 33.63 39.70 39.13 39.70
U5 33.23 38.64 38.64 36.58 31.95
U1 31.87 34.86 33.98 35.56 35.56
U4 30.28 34.37 33.63 34.37 33.89
Variable Importance for Penalized Discriminant Analysis
> varImp(models$pda)
ROC curve variable importance
variables are sorted by maximum importance across the classes
Class_1 Class_2 Class_3 Class_4 Class_5
V5 59.24 100.00 94.89 89.17 100.00
V9 98.50 96.92 98.50 68.75 87.50
V3 13.47 36.09 96.48 96.48 79.93
V4 96.39 93.22 96.39 35.92 0.00
V8 55.19 55.19 74.91 94.98 94.98
V10 68.40 81.16 83.63 83.63 83.01
V6 15.23 44.89 71.21 78.79 78.79
V7 72.89 70.07 73.77 58.27 73.77
V2 31.51 72.62 72.62 54.14 25.09
V11 51.41 67.87 67.87 46.92 50.79
U2 38.47 40.45 38.12 40.45 34.42
U3 33.71 33.63 39.70 39.13 39.70
U5 33.23 38.64 38.64 36.58 31.95
U1 31.87 34.86 33.98 35.56 35.56
U4 30.28 34.37 33.63 34.37 33.89
Parameter Definition and Model Training
Here is the way I trained my models:
library(caret)
inputMatrix <- read.csv("data/inputMatrix.csv", stringsAsFactors = TRUE)
cvMethod <- "repeatedcv" # cross validation method
folds <- 5 # number of folds
repeats <- 10 # number of complete sets of folds
trainPerc <- .80 # percentage of training split
metric <- "Accuracy"
summaryFunction <- defaultSummary
## Parameters for cross-validation and result summary
ctrl <- trainControl(
classProbs = TRUE,
method = cvMethod, # cross-validation method
number = folds, # number of folds
repeats = repeats, # number of complete sets of folds
summaryFunction = summaryFunction,
savePredictions = TRUE,
allowParallel = TRUE)
## Defining tuning parameters for each model
tuneGrids <- list(knn=expand.grid(k=runif(10,10,40)),
pda=expand.grid(lambda=runif(10,0,20)))
## Splitting input matrix into feature matrix and label vector
X <- inputMatrix[, 2:ncol(inputMatrix)]
Y <- as.factor(inputMatrix[, 1])
## Defining training-/testset split
partition <- createDataPartition(Y, p=trainPerc, list=FALSE)
trainData <- X[partition, ]
trainClasses <- Y[partition]
testData <- X[-partition, ]
testClasses <- Y[-partition]
## Train models
models <- list(knn=train(trainData, trainClasses,
method = "knn",
tuneGrid = tuneGrids$knn,
trControl = ctrl,
metric = metric),
pda=train(trainData, trainClasses,
method = "pda",
tuneGrid = tuneGrids$pda,
trControl = ctrl,
metric = metric))
Training Results
When I look at the model summary, I can see that they are in fact the two I've trained:
> models$knn
k-Nearest Neighbors
200 samples
15 predictors
5 classes: 'Class_1', 'Class_2', 'Class_3', 'Class_4', 'Class_5'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 10 times)
Summary of sample sizes: 160, 160, 160, 160, 160, 160, ...
Resampling results across tuning parameters:
k Accuracy Kappa
11.98987 0.8665 0.833125
18.88380 0.8820 0.852500
20.05730 0.8855 0.856875
28.94440 0.8880 0.860000
29.71238 0.8915 0.864375
30.49805 0.8905 0.863125
32.22957 0.8910 0.863750
36.00965 0.8935 0.866875
38.68261 0.8965 0.870625
39.66626 0.8955 0.869375
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 38.68261.
> models$pda
Penalized Discriminant Analysis
200 samples
15 predictors
5 classes: 'Class_1', 'Class_2', 'Class_3', 'Class_4', 'Class_5'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 10 times)
Summary of sample sizes: 160, 160, 160, 160, 160, 160, ...
Resampling results across tuning parameters:
lambda Accuracy Kappa
6.220596 0.9140 0.892500
6.904289 0.9135 0.891875
7.538826 0.9135 0.891875
7.787587 0.9135 0.891875
12.298270 0.9135 0.891875
13.870132 0.9140 0.892500
15.338699 0.9145 0.893125
16.059969 0.9145 0.893125
16.104978 0.9145 0.893125
17.066850 0.9145 0.893125
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was lambda = 15.3387.
Session Info for Reproducibility
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.5 (Santiago)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mda_0.4-8 class_7.3-14 caret_6.0-70 ggplot2_2.1.0 lattice_0.20-33
loaded via a namespace (and not attached):
[1] Rcpp_0.12.5 magrittr_1.5 splines_3.2.3 MASS_7.3-45 munsell_0.4.2 colorspace_1.2-6
[7] foreach_1.4.3 minqa_1.2.4 stringr_1.0.0 car_2.1-1 plyr_1.8.3 tools_3.2.3
[13] nnet_7.3-11 pbkrtest_0.4-4 parallel_3.2.3 grid_3.2.3 gtable_0.1.2 nlme_3.1-122
[19] mgcv_1.8-10 quantreg_5.19 e1071_1.6-7 MatrixModels_0.4-1 iterators_1.0.8 lme4_1.1-10
[25] Matrix_1.2-3 nloptr_1.0.4 reshape2_1.4.1 codetools_0.2-14 stringi_1.0-1 compiler_3.2.3
[31] scales_0.3.0 stats4_3.2.3 SparseM_1.7
I can't wrap my head around, why that happens. Is this to be expected in some way, or do I miss something? This behavior is not limited to knn and pda, but can be seen for each of the variable importance calculations, that are not based on a model specific metric.
Thanks for your help!
Best Answer
Alright, looking into the code of varImp.train I see, that in case of a classification problem, the variable importance is just computed via the filterVarImp function. So the variables are just ranked by the AUC, as stated in the documentation varImp under model independent metrics.
I tested it, by calling varImp on each of my models and comparing the variable importance values with the ones computed via filterVarImp on the training data.
... and surprise surprise, it is exactly the same.
My goal was to come up with a model specific stable feature selection method. The only way I see to achieve this now, is by utilizing caret's inbuilt feature selection methods like "Recursive Feature Selection (RFE)" and "Selection by Filter (SBF)". As far as I understand it, RFE however is only supporting a handful of models in caret. So I might be forced to implement the rfeControl$functions myself.