I am using the following unbalanced dataset, 90% of points are blue and 10% are red :
This can be generated from make_data
and plot_data
functions from the code below.
I used three classfiers: Decision Tree, Random Forest and Boosted Trees; I ran trained on 80% of data, and tested on the 20%
My understanding is that unbalanced data sets will bias a single decision tree; meaning that decision tree classifier will biased towards predicting "blue" because it seen more of it in the training set. Therefore, recall on "red" values should be negatively impacted, since due to bias classifier should predict labels that are truly "red" as "blue".
My understanding is that Boosted Tree classifier, should be able to mitigate the bias, as label weights are constantly adjusted during the training process. Therefore, I expected Decision Tree classifier to have higher recall than Decision Tree classifier and Random Forest something in between. But my results are the opposite:
Decision Tree:
recall score: 0.19975786723
precision score: 0.196744634046
Radnom Forest:
recall score: 0.150153848517
precision score: 0.276472254469
Boosted Tree:
recall score: 0.0572728994607
precision score: 0.473005591932
What am I missing?
Here is my entire code:
import numpy as np
import pandas as pd
import pylab
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import recall_score, precision_score
def make_data(n_class_a, n_class_b, noise):
class_a = np.random.normal(25, noise, (n_class_a, 2))
class_b = np.random.normal(75, noise, (n_class_b, 2))
features = np.vstack((class_a, class_b))
labels = np.hstack((np.ones(n_class_a), np.zeros(n_class_b)))
random_inds = np.arange(len(labels))
np.random.shuffle(random_inds)
features = features[random_inds]
labels = labels[random_inds]
return features, labels
def plot_data(features, labels):
colors = ["blue" if x == 1 else "red" for x in labels]
pylab.scatter(features[:,0], features[:,1], c=colors)
pylab.show()
def print_model_results(model, n_class_a, n_class_b, noise, n_iterations=10, pos_label=0):
recall_sum = 0.0
precision_sum = 0.0
N = int((n_class_a + n_class_b) * 0.8)
for i in range(n_iterations):
features, labels = make_data(n_class_a=n_class_a,
n_class_b=n_class_b,
noise=noise)
model.fit(features[:N], labels[:N])
result = model.predict(features[N:])
recall_sum += recall_score(y_true=labels[N:], y_pred=result, pos_label=pos_label, average="binary")
precision_sum += precision_score(labels[N:], result, pos_label=pos_label, average="binary")
print "recall score: {}".format(recall_sum / n_iterations)
print "precision score: {}".format(precision_sum / n_iterations)
if __name__ == "__main__":
features, labels = make_data(n_class_a=9000, n_class_b=1000, noise=70)
plot_data(features, labels)
dt = DecisionTreeClassifier()
rf = RandomForestClassifier()
gb = GradientBoostingClassifier()
print "Decision Tree:"
print_model_results(dt, n_class_a=9000, n_class_b=1000, noise=70, n_iterations=50)
print "Radnom Forest:"
print_model_results(rf, n_class_a=9000, n_class_b=1000, noise=70, n_iterations=50)
print "Boosted Tree:"
print_model_results(gb, n_class_a=9000, n_class_b=1000, noise=70, n_iterations=50)
Edit:
I played around with changing n_estimators for Boosted Trees. If increase the number of n_estimators, my bias will go down, but so will my variance.
Increasing n_estimators to say n_estimators=3000
makes my results look more like other two classifiers, for example:
Decision Tree:
recall score: 0.175879396985
precision score: 0.169902912621
Radnom Forest:
recall score: 0.142156862745
precision score: 0.295918367347
Boosted Tree:
recall score: 0.111111111111
precision score: 0.315789473684
Best Answer
Why are you iterating through the model results 50 times and adding together all of the sums? I think this is causing your problem. Every time I run this I get different results. You should just create the model 1 time for each algorithm and then compare results. This should give you more accurate results.
Also, this statement is not true unless you set parameters for it -
Records from both class will have a '1' weight unless you specify. Here is the parameter that you need to change for the weighting that you want - (sample_weight)
http://scikitlearn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier.fit
I'm not sure what the default metric is for the GradientBoostingClassfier but the metric is most likely 'better' when it focuses on the majority class which is resulting in a higher recall for class 1 but lower recall for class 0.
Here is my results from running the model once for each algorithm -
So as I said above, you can see that the gradient boosted tree is focusing on the majority and not the minority which is results in the lower recall for class 0.
Let me know if you need more clarification on anything from above.