I'm currently looking at patient data and trying to use it to predict the importance of variables in prediction of response to a drug. I don't have many patients but have a lot of variables which I know is less than ideal.
I've formed a svm ensemble and a random forest model containing all the variables. I have a second dataset with which which I'm trying to test the validity of my models and roughly how important each variable is in prediction. Is it valid to just set all all of the values corresponding to a particular variable to 0 or should I shuffle the values within the column?
Best Answer
I have used the randomForest package in R several times and there were some functions to measure the variable importance such as importance() and varImpPlot(). As far as I know varImpPlot visualizes the the importance of each predictor with respect to variables' contribution in the decrease of error measures (e.g mean squared error for regression, Gini index for classification etc.)
What I usually do to measure variable importance in a really simple way is, I estimate linear and lasso regressions and then see how much the coefficients were shrunk.
And a quick comparison for both:
As far as I understand, Lasso approach could be a bit problematic when predictors are correlated. You can see that rm variable has a larger lasso coefficient than the linear one.