Solved – Find variables selected for each subset using caret feature selection

caretfeature selection

I am doing feature selection using the command 'rfe' in the caret package (http://caret.r-forge.r-project.org/featureselection.html). This command uses a metric to find the optimal amount of variables and which variables that is. However, I would like to also see the other steps in the feature selection than simply the last one. For instance, I would like to know which variables were the optimal ones if I wanted exactly 10 variables.

My code is the following:

ctrl <- rfeControl(functions = rfFuncs,
                   method = "cv",
                   verbose = FALSE)
subsets <- c(5,10,15,20,25)
lmProfile <- rfe(dat2_X, dat2_Y,
                 sizes = subsets,
                 rfeControl = ctrl)

Best Answer

See lmProfile$variables. It has the ranking metrics for each predictor at each iteration. For example, from ?rfe:

data(BloodBrain)

x <- scale(bbbDescr[,-nearZeroVar(bbbDescr)])
x <- x[, -findCorrelation(cor(x), .8)]
x <- as.data.frame(x)

set.seed(1)
lmProfile <- rfe(x, logBBB,
                 sizes = 10:20,
                 rfeControl = rfeControl(functions = lmFuncs, 
                                         number = 15))

head(lmProfile$variables) has:

Overall            var Variables   Resample
4.930084     vsa_other        71 Resample01
4.696723    slogp_vsa5        71 Resample01
3.877510         pnsa1        71 Resample01
3.649555      vsa_base        71 Resample01
3.586327 frac.cation7.        71 Resample01
3.301325        a_base        71 Resample01

For each resample, there are 71 rows here that are the variables selected for a subset size of 71, 20 rows for the ones selected at 20 etc.

Max