Solved – random forest variables importance with continuous and categorical variables and unbalanced output

importancerrandom forest

I am a bit lost in the literature regarding the random forest importance.
I am aware that there are different methods.

I have a binary output variables where elements labelled with 0 are much more than the ones labelled with 1.

5 variables are used as input. Some of them are continuous and some others are categorical.

Given a dataset of this type I am wondering what is the best method to asses variables importance with Random Forest and if this is available in any R or python library.

I know that the standard approach based the Gini impurity index is not suitable for this case due the presence of continuous and categorical input variables

Best Answer

Random forests for classification might use two kind of variable importance. See the original description of the RF here.

"I know that the standard approach based the Gini impurity index is not suitable for this case due the presence of continuos and categorical input variables"

This is plain wrong. The gini impurity is build using only the proportions of the target/dependent variable, when split by a test which involves either numerical or nominal independent variable. Note that the independent variable plays a role only for building the split test, the computation of gini index is based only on counts on dependent variable after split. Of course, the gini impurity index on each node is used further to compute gini importance.

I do not know if that would count, but my personal experiments revealed taht there are no big differences between gini variable importance and permutation value importance. And I usually prefered the former.

The second problem is the unbalance of the samples labeled with 1 and 0. I think this might play a role on variable importance, but to be honest I would verify if this is the case. Thus I would repeat many times various computations of variable importance with various sampels having different proportions varying gradually from 0.5 ratio to the actual ration. I expect that finding a stable variable importance no matter proportion to not be so unexpected.

[later edit]

It took me some time to compile the document provided by @Donbeo. I agree with the results from that paper and I hope that I would further experiment myself with that. The only think which I do not like about that study is that it does not state which number of trees were used and what would imply the variation of this parameter. The single note regarding that is that the number of trees affects the scaled version for permutation tests.

Related Question