Solved – Step-wise feature selection with caret

feature selectionmachine learningr

can anyone direct me to a package/commands in R for performing step-wise feature selection, preferably using the caret package.

I have already used linear discriminant analysis (LDA), Random forest, PCA and a wrapper using a support vector machine. I was thinking of including a partial least sqaures or a gradient boosting method, but while trying to use them on multi-class data, they cause R to crash. People have reported similar experiences on multi-class data using caret when attempting to use gbm.

I realize that I haven't used a step-wise approach and I was searching for one that can be implemented on highly correlated, dependent variables for selecting the best performing 20 variables (for example) to create a parsimonious model.

Any suggestions would be welcomed

Best Answer

caret has a stepLDA method available in train:

slda <- train(Species ~ ., data = iris,
              method = "stepLDA",
              trControl = trainControl(method = "cv"))

This uses stepclass in the klaR package. There are also LDA feature selection tools in caret using rfe and sbf that would be helpful.

Max

Related Solutions

Solved – Feature selection using caret + repeatedcv

Nobody ever reads the documentation :-/

The package vignette for feature selection had all the details. They can know be found at:

http://caret.r-forge.r-project.org/featureselection.html

in Algorithm #2.

In your case, you have inner resampling to tune the SVM at each iteration (line 2.9 if Algo #2) and an external one to evaluate the number of predictors (line 2.1).

Why does it do this? With small to moderate numbers of instances, a simple partition to a single test set does a very poor job of estimating performance and may very well over-fit to the predictors. [1] concisely summarize this point: ``hold--out samples of tolerable size [...] do not match the cross--validation itself for reliability in assessing model fit and are hard to motivate''.

I would advise reading [2], which reflects how difficult validating feature selection can be. If you have a lot of data, perhaps a single test set would be sufficient.

One other note: you don't show what svmFuncs is exactly, so I don't know how you are estimating variable importance. If you are using the default method, it does the analysis for each predictor independently so using rerank = TRUE is a waste of time (i.e the values will be the same at each calculation).

Max

[1] Hawkins, D. M., Basak, S. C., & Mills, D. (2003). Assessing Model Fit by Cross-Validation. Journal of Chemical Information and Modeling, 43(2), 579–586. doi:10.1021/ci025626i

[2] Ambroise, C., & McLachlan, G. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences, 99(10), 6562–6566.

Solved – Feature selection with Caret for data with more than one target

I don't think caret supports multi-task learning in any of its functions. You could try the glmnet package, with distribution set to mgaussian. This will allow you to do feature selection via lasso regularization, ridge regularization, or elastic net regularization for a linear regression model.

There may be other R machine learning libraries with built-in feature selection that support multi-task learning. Here's some sample code for multi-task learning, using the lasso for variable selection, adapted from ?glmnet:

#Create a dataset
set.seed(42)
library(glmnet)
x=matrix(rnorm(100*20),100,20)
cf <- sample(0:1, 20, replace=TRUE) #Select some columns
response1 <- x %*% cf*runif(20) #Apply random coefficients
response2 <- x %*% cf*runif(20)
y=cbind(response1, response2)

#Fit a single lasso model
#0 for ridge
#1 for lasso
#>0 & <1 for the elastic net (mix of ridge and lasoo)
fit1m=glmnet(x,y,family="mgaussian",alpha=1)
plot(fit1m,type.coef="2norm")

#Select lambda through cross validation
fit1m.cv <- cv.glmnet(x,y,family="mgaussian",alpha=1) 
plot(fit1m.cv)
coef(fit1m.cv) #Show coefficients at the selected value of lambda

Best Answer

Related Solutions

Solved – Feature selection using caret + repeatedcv

Solved – Feature selection with Caret for data with more than one target

Related Question