MATLAB: Feature Selection in TreeBagger

classificationsplitstreebagger

Hello MathWorks community
I'm currently working with the TreeBagger class to generate some classification tree esembles. Now I would like to know, how it decides wich features are used for splitting the data. If I create for example an esemble of tree stumps with 5000 trees and use it to classify a dataset with two features (e.g. VRQL-Value and maximum frequency), and then check which feature was selected for splitting for every single tree like this:
cellArray={};
for y=1:length(Random_Forest_Model.Trees)
cellArray{y}=Random_Forest_Model.Trees{y}.CutPredictor{1};
end
It happens in some cases, that only one feature was selected for all 5000 trees and the other feature was selected in not a single case (i.e. cellArray looks like this: {'x2', 'x2', 'x2', …, 'x2', }). This can also happen with multiple features: only one feature is selected, the others are ignored.
Maybe important things to mention about the dataset:
-One feature achieves Values from 1 to 100, the other one from about 200 to 1200
-The classes are imbalanced (class 1: 52 entries, class 2: over 300 entries)
-only the greater class contains the NaNs
-both features contain NaNs
My question now is: how can I achieve, that the TreeBagger uses all features for classification and not only one or how can I in genreal achieve a more balanced selection of features.

Best Answer

The default setting in TreeBagger for the number of features to sample from the original set of features is ceil(sqrt()).
Why this number specifically? I don't know...
But why is it important to take a subset of the features and not the whole set of features? It's because if you always take the same features (say the whole set of features) you will get highly correlated decision trees in every iteration, and thereby will not be able to cancel out their inherint great varience.
I beleive that the features are sampled in a uniform fashion, which means that if you have many trees, approximately all features should be represented equally over all of the trees.
However, in your case the subset of the features has the same size of the original feature set ( ceil(sqrt(2)) = 2 ). Once the set of features is selected, a certain criterion is used to select which feature should the split be based on. The criteria can be the Gini index, or information gain (entropy).
So my guess is that since you're always ending up with the whole set of features, and everytime the same criterion is used to choose which feature to go with, you're always ending up with the same feature, and the other one is excluded.