Solved – How to perform genetic-algorithm variable selection in R for SVM input variables

genetic algorithmsmachine learningrsvm

I'm using the kernlab package in R to build an SVM for classifying some data.

The SVM is working nicely in that it provides 'predictions' of a decent accuracy, however my list of input variables is larger than I would like and I am unsure as to the relative importance of the different variables.

I'd like to implement a genetic-algorithm to select the sub-set of input variables that produces the best-trained/fittest SVM.

I'd like some help with choosing which R package to use when attempting this GA implementation (and possibly a brief psuedo-example).

I've looked a most of the R GA/P packages out there (RGP, genalg, subselect, GALGO), but I'm struggling conceptually to see how I would pass in my ksvm function as part of the fitness function and input my variable array as the population pool…?

Any help, thoughts, or nudges in the right direction gratefully received.

Thanks

code that solves this added below in a later EDIT

# Prediction function to be used for backtesting
pred1pd = function(t) {
print(t)
##add section to select the best variable set from those available using GA
  # evaluation function - selects the best indicators based on miminsied training error
mi.evaluate <- function(string=c()) {
    tmp <- data[(t-lookback):t,-1]
    x <- string
    tmp <- tmp[,x==1]
    tmp <- cbind(data[(t-lookback):t,1],tmp)
    colnames(tmp)[1] <- "targets"
    trainedmodel = ksvm(targets ~ ., data = tmp, type = ktype, kernel="rbfdot", kpar=list(sigma=0.1), C = C, prob.model = FALSE, cross = crossvalid)
    result <- error(trainedmodel)
    print(result)
    }

## monitor tge GA process
monitor <- function(obj) {
minEval = min(obj$evaluations);
plot(obj, type="hist");
}

## pass out the GA results; size is set to be the number of potential indicators
gaResults <- rbga.bin(size=39, mutationChance=0.10, zeroToOneRatio=10, evalFunc=mi.evaluate, verbose=TRUE, monitorFunc=monitor, popSize=50, iters=3, elitism=10)

## now need to pull out the best chromosome and rebuild the data frame based on these results so that we can train the model

bestChro <- gaResults$population[1,]
newData <- data[,-1]
newData <- newData[,bestChro==1]
newData <- cbind(data[,1],newData)
colnames(newData)[1] <- "targets"
print(colnames(newData))

# Train model using new data set
model = trainSVM(newData[(t-lookback):t, ], ktype, C, crossvalid)
# Prediction
pred = as.numeric(as.vector(predict(model, newData[t+1, -1], type="response")))
# Print for user inspection
print(pred)
}

Best Answer

My advice would be to not do this. The theoretical advantages of the SVM that avoid over-fitting apply only to the determination of the lagrange multipliers (the parameters of the model). As soon as you start performing feature selection, those advantages are essentially lost, as there is little theory that covers model selection or feature selection, and you are highly likely to over-fit the feature selection criterion, especially if you search really hard using a GA. If feature selection is important, I would use something like LASSO, LARS or Elastic net, where the feature selection arises via reguarisation, where the feature selection is more constrained, so there are fewer effective degrees of freedom, and less over-fitting.

Note a key advantage of the SVM is that is is an approximate implementation of a generalisation bound which is independent of the dimensionality of the feature space, which suggests that feature selection perhaps shouldn't necessarily be expected to improve performance, and if there is a defficiency in the selection prcess (e.g. over-fitting the selection criterion) it may well make things worse!