Solved – Interpret Variable Importance (varImp) for Factor Variables

caretrrandom forest

When I run variable importance on a random forest (or any other model), the factor/categorical variable names have the factor name as the suffix. For example,

SALARY ~ STATE + CITY + AGE + …, the result of varImp(model) could look like,

> varImp(model)
rf variable importance

only 20 most important variables shown (out of 1050)

                      Importance
AGE                       100.00
STATECA                    91.84
STATEAZ                    86.24
CITYSTANFORD               74.15
STATEVT                    71.27

In terms of relative importance, would it be right to interpret this as AGE is the most important predictor, followed by STATE followed by CITY?

The importance values also do not say anything about the relationship between the predictor and the outcome–for example, does higher age equate to higher salary, does STATE CA mean higher salary, etc. Any suggestions on how such measures can be obtained for "black box" models such as random forest, gbm, etc. would be very helpful.

Best Answer

The random forest variable importance scores are aggregate measures. They only quantify the impact of the predictor, not the specific effect.

You could fix the other predictors to a single value and get a profile of predicted values over a single parameter (see partialPlot in the randomForest package). Otherwise, fit a parametric model where you can estimate specific structural terms.

Your other question is about the effect of CITY and STATE. You may have used the formula interface when creating the model (i.e. function(y ~ ., data = dat)). In this case, the formula interface might be breaking up the factor into dummy variables (as it should). You might try using a non-formula interface where x has the predictors (in factor form), y is the outcome and the call function(x, y). That will avoid dummy variable creation and treat a factor predictor as a cohesive set.

Related Question