Solved – Random Forest Variable Importance in R and Python

machine learningpythonrrandom forestscikit learn

I am new to random forests and have tried to implement it in both R and Python. I was following a paper that compared the two and followed their exact steps, however, when I plot the variable importance, the Gini index scaling is different and the variables seem to be different as well. Can someone please explain why the scaling is different, and which software I should base my results upon? If my main concern is variable importance, should I run the random forest through my whole data set, or is it still wise to split my data into a train/test set and retrieve the important variables from the training set? I tried implementing rf using the whole data set just to see if that would give me the same results and I still get different results.

R code

rf <- randomForest(Cell~., data = val, importance = TRUE)

R plot

Python code

rnd_clf = RandomForestClassifier(n_estimators=500,oob_score=True,criterion='gini')
bc_rf = rnd_clf.fit(dat1, bc_Y)

Python plot

Best Answer

Unfortunately the implementation of random forests between Python and R are not always directly comparable. If anything, because the random forest algorithm inherently performs bagging and random selection of explanatory variables (i.e. it samples both the rows and the columns of our training set when training), if this resampling is not done in the same way it can lead to slightly different results. Transferring random seeds between R and Python is not straightforward.

I think that neither implementation is "better". If I was forced to choose between the two in this case, I would simply pick the one with the small CV error; that way I would be able to point out that a fair experiment irrespective of implementation choice has been made and the "most performant" algorithm was picked. An excellent overview of different cross-validation procedures can be found in Arlot & Celisse (2010) A survey of cross-validation procedures for model selection, I would suggest looking at it carefully.

Finally yes, the whole dataset should be used when training our final model. That way we can get the maximum insights from our available data. Notice that this concerns "using the model" to production (or in this case reporting the model weights), rather than "comparing the model" against another one (in which case the CV (or resampling) is more relevant). In case you are unaware of it you might find the R package randomForestExplainer, it offers some excellent visualisation functions for RF in terms of variable importances. I strongly suspect similar packages exist for Python but I have not come across them personally.

Related Question