Solved – getting lower recall with Boosted Tree than Decision Tree on unbalanced data

boostingcartmachine learningunbalanced-classes

I am using the following unbalanced dataset, 90% of points are blue and 10% are red :

enter image description here

This can be generated from make_data and plot_data functions from the code below.

I used three classfiers: Decision Tree, Random Forest and Boosted Trees; I ran trained on 80% of data, and tested on the 20%

My understanding is that unbalanced data sets will bias a single decision tree; meaning that decision tree classifier will biased towards predicting "blue" because it seen more of it in the training set. Therefore, recall on "red" values should be negatively impacted, since due to bias classifier should predict labels that are truly "red" as "blue".

My understanding is that Boosted Tree classifier, should be able to mitigate the bias, as label weights are constantly adjusted during the training process. Therefore, I expected Decision Tree classifier to have higher recall than Decision Tree classifier and Random Forest something in between. But my results are the opposite:

Decision Tree:
recall score: 0.19975786723
precision score: 0.196744634046
Radnom Forest:
recall score: 0.150153848517
precision score: 0.276472254469
Boosted Tree:
recall score: 0.0572728994607
precision score: 0.473005591932

What am I missing?

Here is my entire code:

import numpy as np
import pandas as pd
import pylab
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import recall_score, precision_score

def make_data(n_class_a, n_class_b, noise):
    class_a = np.random.normal(25, noise, (n_class_a, 2))
    class_b = np.random.normal(75, noise, (n_class_b, 2))

    features = np.vstack((class_a, class_b))
    labels = np.hstack((np.ones(n_class_a), np.zeros(n_class_b)))

    random_inds = np.arange(len(labels))
    np.random.shuffle(random_inds)
    features = features[random_inds]
    labels = labels[random_inds]

    return features, labels


def plot_data(features, labels):

    colors = ["blue" if x == 1 else "red" for x in labels]

    pylab.scatter(features[:,0], features[:,1], c=colors)
    pylab.show()

def print_model_results(model, n_class_a, n_class_b, noise, n_iterations=10, pos_label=0):

    recall_sum = 0.0
    precision_sum = 0.0

    N = int((n_class_a + n_class_b) * 0.8)

    for i in range(n_iterations):

        features, labels = make_data(n_class_a=n_class_a, 
                                     n_class_b=n_class_b,
                                     noise=noise)

        model.fit(features[:N], labels[:N])
        result = model.predict(features[N:])
        recall_sum +=  recall_score(y_true=labels[N:], y_pred=result, pos_label=pos_label, average="binary")
        precision_sum += precision_score(labels[N:], result, pos_label=pos_label, average="binary")

    print "recall score: {}".format(recall_sum / n_iterations)
    print "precision score: {}".format(precision_sum / n_iterations)




if __name__ == "__main__":


    features, labels = make_data(n_class_a=9000, n_class_b=1000, noise=70)
    plot_data(features, labels)

    dt = DecisionTreeClassifier()
    rf = RandomForestClassifier()
    gb = GradientBoostingClassifier()

    print "Decision Tree:"
    print_model_results(dt, n_class_a=9000, n_class_b=1000, noise=70, n_iterations=50)
    print "Radnom Forest:"
    print_model_results(rf, n_class_a=9000, n_class_b=1000, noise=70, n_iterations=50)
    print "Boosted Tree:"
    print_model_results(gb, n_class_a=9000, n_class_b=1000, noise=70, n_iterations=50)

Edit:

I played around with changing n_estimators for Boosted Trees. If increase the number of n_estimators, my bias will go down, but so will my variance.

Increasing n_estimators to say n_estimators=3000 makes my results look more like other two classifiers, for example:

Decision Tree:
recall score: 0.175879396985
precision score: 0.169902912621
Radnom Forest:
recall score: 0.142156862745
precision score: 0.295918367347
Boosted Tree:
recall score: 0.111111111111
precision score: 0.315789473684

Best Answer

Why are you iterating through the model results 50 times and adding together all of the sums? I think this is causing your problem. Every time I run this I get different results. You should just create the model 1 time for each algorithm and then compare results. This should give you more accurate results.

Also, this statement is not true unless you set parameters for it -

as label weights are constantly adjusted during the training process.

Records from both class will have a '1' weight unless you specify. Here is the parameter that you need to change for the weighting that you want - (sample_weight)

http://scikitlearn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier.fit

I'm not sure what the default metric is for the GradientBoostingClassfier but the metric is most likely 'better' when it focuses on the majority class which is resulting in a higher recall for class 1 but lower recall for class 0.

Here is my results from running the model once for each algorithm -

Decision Tree:
TP 1640
FP 145
TN 38
FP 145

CLASS 0 - RECALL: 0.20765027322404372
CLASS 0 - PRECISION 0.17674418604651163

CLASS 1 - RECALL: 0.9025866813428729
CLASS 1 - PRECISION 0.9187675070028011

recall score: 0.20765027322404372
precision score: 0.17674418604651163
Radnom Forest:
TP 1708
FP 167
TN 36
FP 167

CLASS 0 - RECALL: 0.17733990147783252
CLASS 0 - PRECISION 0.288

CLASS 1 - RECALL: 0.9504730105731776
CLASS 1 - PRECISION 0.9109333333333334

recall score: 0.17733990147783252
precision score: 0.288
Boosted Tree:
TP 1785
FP 183
TN 18
FP 183

CLASS 0 - RECALL: 0.08955223880597014
CLASS 0 - PRECISION 0.5625

CLASS 1 - RECALL: 0.9922178988326849
CLASS 1 - PRECISION 0.9070121951219512

recall score: 0.08955223880597014
precision score: 0.5625  

So as I said above, you can see that the gradient boosted tree is focusing on the majority and not the minority which is results in the lower recall for class 0.

Let me know if you need more clarification on anything from above.