Solved – Feature Importance in each fold and repeat after repeated cross validation in caret

caretelastic netfeature selectionr

this is my first post on Cross Validated so I apologize in advance if I'm not yet familiar with any conventions regarding forum posts.
Currently, I'm working on a feature selection task using elastic net in caret and I would like to visualize the feature importance for each model trained in each cross validation step. However I don't seem to find a way to access the coefficients, other than those of the final model.

Here is more or less a minimal example of what I'm doing.

library(caret)
library(doMC)

registerDoMC(16) # register 16 cores

Define tuning grid of elastic net parameters.

grid <- expand.grid(.lambda = seq(0, 1, length=20),
                    .alpha = seq(0, 1, length = 11))

Provide control object defining repeated cross validation resulting in 5-fold cross validation each repeated 5 times.

ctrl <- trainControl(method = "repeatedcv",  # cross-validation method
                     number = 5,             # number of folds
                     repeats = 5,            # number of complete sets of folds
                     allowParallel = TRUE)   # utilize parallelization

Train models with defined cross validation scheme and parameter grid
Furthermore the featuers will be centered and scaled.

model <- train(x = iris[,-5],
               y = iris$Species,
               method = "glmnet",
               type.gaussian = "naive",
               tuneGrid = grid,
               trControl = ctrl,
               preProc = c("center", "scale"))

Alright, now I can get some information about the test performance after repeated cross validation.

model$resample[with(model$resample, order(Resample)), ]

Accuracy Kappa   Resample
12 1.0000000  1.00 Fold1.Rep1
19 1.0000000  1.00 Fold1.Rep2
25 1.0000000  1.00 Fold1.Rep3
2  0.9666667  0.95 Fold1.Rep4
9  0.9333333  0.90 Fold1.Rep5
1  0.9000000  0.85 Fold2.Rep1
8  0.9333333  0.90 Fold2.Rep2
15 0.9666667  0.95 Fold2.Rep3
22 0.9333333  0.90 Fold2.Rep4
16 0.9333333  0.90 Fold2.Rep5
18 1.0000000  1.00 Fold3.Rep1
11 0.9666667  0.95 Fold3.Rep2
5  1.0000000  1.00 Fold3.Rep3
3  1.0000000  1.00 Fold3.Rep4
6  1.0000000  1.00 Fold3.Rep5
23 1.0000000  1.00 Fold4.Rep1
7  1.0000000  1.00 Fold4.Rep2
10 1.0000000  1.00 Fold4.Rep3
17 0.9333333  0.90 Fold4.Rep4
24 1.0000000  1.00 Fold4.Rep5
13 0.9333333  0.90 Fold5.Rep1
20 0.9666667  0.95 Fold5.Rep2
14 0.9333333  0.90 Fold5.Rep3
21 1.0000000  1.00 Fold5.Rep4
4  1.0000000  1.00 Fold5.Rep5

However, I don't see how to access the coefficients for the models generating the respective cv accuracies, to visualize the variable importance in the same way it is possible for the final model.

plot(varImp(model))

Variable Importance for final model

I would very much appreciate your help.

Best Answer

train doesn't save the model information within a fold. You can save the models out to the file system using a custom model:

glmn_funcs <- getModelInfo("glmnet", regex = FALSE)[[1]]
glmn_funcs$fit <- function(x, y, wts, param, lev, last, classProbs, ...) {
    theDots <- list(...)  
    if(all(names(theDots) != "family")) theDots$family <- "multinomial"   
    modelArgs <- c(list(x = as.matrix(x), y = y, alpha = param$alpha),
                        theDots)

    out <- do.call("glmnet", modelArgs) 
    if(!is.na(param$lambda[1])) out$lambdaOpt <- param$lambda[1]
        save(out, file = paste("~/tmp/glmn", param$alpha, 
                          floor(runif(1, 0, 1)*100), ## to help uniqueness
                          format(Sys.time(), "%H_%M_%S.RData"),
                          sep = "_")
    out 
  }

model <- train(x = iris[,-5],
               y = iris$Species,
               method = glmn_funcs,
               type.gaussian = "naive",
               tuneGrid = grid,
               trControl = ctrl,
               preProc = c("center", "scale"))

You can use the coef function on each model to get the slopes. Note that train did not fit all possible models, which is

> length(model$control$index)*nrow(grid)
[1] 5500

(omitting the one for the final model). It fits one per unique alpha per fold:

> length(unique(grid$.alpha))*length(model$control$index)
[1] 275
> length(list.files("~/tmp", pattern = "glmn_")) ##includes the final model
[1] 276

So you will have to do some looping using something like:

> params <- coef(out, s = unique(grid$.lambda), type = "nonzeo")
    > names(params) ## a matrix per class
    [1] "setosa"     "versicolor" "virginica" 
    > lapply(params, dim)
    $setosa
[1]  5 20

$versicolor
[1]  5 20

$virginica
[1]  5 20

Lastly, you don't need to prefix a period before the parameter names using recent versions of caret.

Max

Related Solutions

PCA and K-Fold Cross-Validation with Caret Package in R

I didn't see the lecture, so I can't comment on what was said.

My $0.02: If you want to get good estimates of performance using resampling, you should really do all of the operations during resampling instead of prior. This is really true of feature selection [1] as well as non-trivial operations like PCA. If it adds uncertainty to the results, include it in resampling.

Think about principal component regression: PCA followed by linear regression on some of the components. PCA estimates parameters (with noise) and the number of components must also be chosen (different values will result in different results => more noise).

Say we used 10 fold CV with scheme 1:

conduct PCA
pick the number of components
for each fold:
   split data
   fit linear regression on the 90% used for training
   predict the 10% held out
end:

or scheme 2:

for each fold:
   split data
   conduct PCA on the 90% used for training
   pick the number of components
   fit linear regression
   predict the 10% held out
end:

It should be clear than the second approach should produce error estimates that reflect the uncertainty caused by PCA, selection of the number of components and the linear regression. In effect, the CV in the first scheme has no idea what preceded it.

I'm guilty of not always doing all the operations w/in resampling, but only when I don't really care about performance estimates (which is unusual).

Is there much difference between the two schemes? It depends on the data and the pre-processing. If you are only centering and scaling, probably not. If you have a ton of data, probably not. As the training set size goes down, the risk of getting poor estimates goes up, especially if n is close to p.

I can say with certainty from experience that not including supervised feature selection within resampling is a really bad idea (without large training sets). I don't see why pre-processing would be immune to this (to some degree).

@mchangun: I think that the number of components is a tuning parameter and you would probably want to pick it using performance estimates that are generalizable. You could automatically pick K such that at least X% of the variance is explained and include that process within resampling so we account for the noise in that process.

Max

[1] Ambroise, C., & McLachlan, G. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences, 99(10), 6562–6566.

Feature Selection – Find Variables Selected for Each Subset Using Caret

See lmProfile$variables. It has the ranking metrics for each predictor at each iteration. For example, from ?rfe:

data(BloodBrain)

x <- scale(bbbDescr[,-nearZeroVar(bbbDescr)])
x <- x[, -findCorrelation(cor(x), .8)]
x <- as.data.frame(x)

set.seed(1)
lmProfile <- rfe(x, logBBB,
                 sizes = 10:20,
                 rfeControl = rfeControl(functions = lmFuncs, 
                                         number = 15))

head(lmProfile$variables) has:

Overall            var Variables   Resample
4.930084     vsa_other        71 Resample01
4.696723    slogp_vsa5        71 Resample01
3.877510         pnsa1        71 Resample01
3.649555      vsa_base        71 Resample01
3.586327 frac.cation7.        71 Resample01
3.301325        a_base        71 Resample01

For each resample, there are 71 rows here that are the variables selected for a subset size of 71, 20 rows for the ones selected at 20 etc.

Max

Best Answer

Related Solutions

PCA and K-Fold Cross-Validation with Caret Package in R

Feature Selection – Find Variables Selected for Each Subset Using Caret

Related Question