Solved – Random Forest – Numeric and Dumthe Variables together

logisticmany-categoriespredictive-modelsrrandom forest

I am trying to create a logistic regression model and a random forest model on the same data to predict probability of default. For the logistic regression model, I have created some dummy variables from categorical variables. Finally, for the input of logistic regression, I have 9 dummy variables and 2 numeric variables (age and level, age takes values from 18 to 60, level from 4 to 10). I want to use same input dataset for the random forest model. When I did so, using "randomForest" Package, I get following Variable Importance Plot.

enter image description here

Level seems to be a very good variable both by MSE and Node Purity. Also, level is a very important variable in logistic regression (p value ~ 10^-5).
However, Age is very important by Node purity, but not by MSE. Also, in logistic regression, age is not a very good variable with p value of 0.026. So I want to understand, Does being numeric increases the node purity importance of a variable by overfitting? Is it not suitable to use numeric and dummy variables together in random forest model? Or is there something I am missing.

I had similar doubts about using numeric and dummy variables in logistic regression, but in logistic regression it did not create any problem.

Best Answer

RF is one of the most robust techniques for handling a combination of data types, yet it can mishandle data in cases when there are data with very few categories (particuarly if they are unbalanced) and many categories. Several options to explore: what is the total amount of variation explained (if very small, the discrepancy is not surprising; and will also tell whether it is indeed 'very good'); are any of the categories unbalanced? are the numerical predictors strongly correlated? have RF setting been optimized? PS What do you mean by "in logistic regression, age is not a very good variable with p value of 0.026"?