Using caret, I want to train a SVM classifier and estimate its performance using repeated cross validation. My dataset has a very large number of predictors (300K) and I want to reduce this number using a super simple univariate approach (like t-test p-value below a threshold – or two-class anova is fine too). If I want to customize the filter threshold to use only very significant predictors, I believe this is working for me:
require(caret)
simdata <- twoClassSim(n = 100, linearVars = 300000)
mySBF <- lmSBF
mySBF$filter <- function(score, x, y) { score <= 10e-6 }
fit <- sbf(
form = Class ~ .,
data = simdata,
method = "svmLinear",
sbfControl = sbfControl(
functions = mySBF,
method = 'repeatedcv',
number = 4,
repeats = 10
)
)
But what if my strategy is to rank the predictors by p-value and simply take the top 100? Can anyone suggest a way to accomplish this? I don't see an obvious way to do that, since the functions of sbf appeared to be applied one predictor at a time.
(I may not be using the twoClassSim function correctly — just trying too provide a reproducible example).
Thanks
Best Answer
A few things here:
lmSBF
is for linear regression.twoClassSim
simulates classification data and you would't want to use a linear regression model for that.method = "svmLinear"
you'll need to usecaretSBF
or write your ownfit
function. You should give this page a good read since a lot of the information that you want is there.TRUE
for the 10 smallest scores.The code below probably does what you want. I didn't tune the model over the cost value but you could if needed.
Max