Hello MathWorks community
I'm currently working with the TreeBagger class to generate some classification tree esembles. Now I would like to know, how it decides wich features are used for splitting the data. If I create for example an esemble of tree stumps with 5000 trees and use it to classify a dataset with two features (e.g. VRQL-Value and maximum frequency), and then check which feature was selected for splitting for every single tree like this:
cellArray={};for y=1:length(Random_Forest_Model.Trees)cellArray{y}=Random_Forest_Model.Trees{y}.CutPredictor{1};end
It happens in some cases, that only one feature was selected for all 5000 trees and the other feature was selected in not a single case (i.e. cellArray looks like this: {'x2', 'x2', 'x2', …, 'x2', }). This can also happen with multiple features: only one feature is selected, the others are ignored.
Maybe important things to mention about the dataset:
-One feature achieves Values from 1 to 100, the other one from about 200 to 1200
-The classes are imbalanced (class 1: 52 entries, class 2: over 300 entries)
-only the greater class contains the NaNs
-both features contain NaNs
My question now is: how can I achieve, that the TreeBagger uses all features for classification and not only one or how can I in genreal achieve a more balanced selection of features.
Best Answer