Solved – Testing variable importance in prediction

machine learningrandom forestsvm

I'm currently looking at patient data and trying to use it to predict the importance of variables in prediction of response to a drug. I don't have many patients but have a lot of variables which I know is less than ideal.

I've formed a svm ensemble and a random forest model containing all the variables. I have a second dataset with which which I'm trying to test the validity of my models and roughly how important each variable is in prediction. Is it valid to just set all all of the values corresponding to a particular variable to 0 or should I shuffle the values within the column?

Best Answer

I have used the randomForest package in R several times and there were some functions to measure the variable importance such as importance() and varImpPlot(). As far as I know varImpPlot visualizes the the importance of each predictor with respect to variables' contribution in the decrease of error measures (e.g mean squared error for regression, Gini index for classification etc.)

What I usually do to measure variable importance in a really simple way is, I estimate linear and lasso regressions and then see how much the coefficients were shrunk.

library(MASS)
library(randomForest)
library(glmnet)
data(Boston)

# Random forest (based on the lab example from the book "An Introduction to Statistical Learning")
rf.boston <- randomForest(medv ~., data = Boston, mtry = 13, ntree = 25, importance = TRUE)
importance(rf.boston)  
            %IncMSE IncNodePurity
 crim     2.80848132    1340.74773
 zn       2.12135233      34.74243
 indus    1.49676063     270.12398
 chas     0.06577971      31.42226
 nox      7.42985606    1381.58615
 rm      13.41323143   18128.73241
 age      6.28896854     487.95644
 dis      7.08361676    2621.61526
 rad      1.71445398     128.88846
 tax      6.48150760     557.31305
 ptratio  5.24860362     660.97934
 black    2.00139088     562.16876
 lstat    9.26159315   16553.01919

varImpPlot(rf.boston)

enter image description here

# Linear model
lm.boston <- lm(formula = medv~., data = Boston)

# Lasso 
optim.lambda <- cv.glmnet(x = as.matrix(Boston[, -14]), y = as.vector(Boston[, 14]))$lambda.1se

lasso.boston<- glmnet(x = as.matrix(Boston[, -14]), y = as.vector(Boston[, 14]), 
               lambda = optim.lambda)


sum.abs <- abs(coef(lasso.boston)[-1])/ abs(coef(lm.boston)[-1])
sum.abs <- sum.abs[order(sum.abs, decreasing = F)]
barplot(sum.abs, horiz = T, col = "red", las=2) 

enter image description here

par(mfrow = c(2,1))
rf.boston <- randomForest(medv ~., data = Boston, mtry = 13, ntree = 25, importance = FALSE)
varImpPlot(rf.boston, main = "Variable importance (Random forest)")
barplot(sum.abs, horiz = T, col = "red", las=2, main = "Variable importance (Lasso)")

And a quick comparison for both: enter image description here

As far as I understand, Lasso approach could be a bit problematic when predictors are correlated. You can see that rm variable has a larger lasso coefficient than the linear one.