MATLAB: TreeBagger gives different results depending on ‘oobvarimp’ being ‘on’ or ‘off’

classificaionconfusion matrixilya narskyoobvarimpStatistics and Machine Learning Toolboxtreebagger

The turning oobvarimp option 'on' or 'off' is only supposed to change the computed measure of variable importance. It should not change the classification itself.

However, I have recently realized that it also produces a different classification. Below my code, and the resulting confusion matrix:

First, I run TreeBagger with exact same data and options, except for the oobvarimp status (on/off)

Here is the 'off' version

RandStream.setDefaultStream(RandStream('mlfg6331_64','seed',27));
     model2roff = TreeBagger(400, Xr1, Y1, 'Method', 'classification',    'oobpred', 'on', 'oobvarimp', 'off', 'nprint', 100, 'MinLeaf', 1, 'prior', 'equal', 'cost', cost, 'categorical', find(iscatr));

Here is the 'on' version

RandStream.setDefaultStream(RandStream('mlfg6331_64','seed',27));
     model2ron = TreeBagger(400, Xr1, Y1, 'Method', 'classification', 'oobpred', 'on', 'oobvarimp', 'on', 'nprint', 100, 'MinLeaf', 1, 'prior', 'equal', 'cost', cost, 'categorical', find(iscatr));

I then compute the confusion matrices using the following code, using first model2ron then model2roff. In theory, these should be identical. The same TreeBagger model should have been created with both 'off' and 'on' options. The only thing that should have changed is that the model should store a different measure of variable importance. This should not effect classification performance (using identical data, variables, etc…)

[pred_model2r_oobY1, pred_model2r_oobY1scores] = oobPredict(model2r);
[conf, classorder] = confusionmat(Y1, pred_model2r_oobY1,'order',classorder);
disp(dataset({conf,classorder{:}}, 'obsnames', classorder));

So, here are the results:

First, with oobvarimp 'off'

                pos_outcome    neg_outcome
    pos_outcome    104            21         
    neg_outcome     23            62

Next, with oobvarimp 'on'

                   pos_outcome    neg_outcome
    pos_outcome    99             26         
    neg_outcome    30             55

You can see that there has been a significant change (even a small one would be problematic since the forests should be identical).

Has anyone else observed this? Does anyone (Ilya Narsky) have an explanation?

Best Answer

Computing variable importance by permuting observations across every variable (that's what you get when you set oobvarimp to on) requires more runs of the random number generator. That's why the results are not identical.

oobvarimp does not change the classification in a meaningful way. What you observe are statistical fluctuations.

Best Answer

Related Solutions

MATLAB: Can someone tell me about the algorithm used to obtain the prediction of the bagged models in Ensemble bagged tree classification in Matlab

MATLAB: Using the Classification Learner to train a Multinomial Logistic Regression

Related Question