Solved – Gradient boosting – extreme predictions vs predictions close to 0.5

boostingcartclassification

Let's say you train two different Gradient Boosting Classifier models on two different datasets. You use leave-one-out cross-validation, and you plot the histograms of predictions that the two models output. The histograms look like this:
enter image description here

and this:

enter image description here

So, in one case, predictions (on out-of-sample / validation sets) are mostly extreme (close to 0 and 1), and in the other case predictions are close to 0.5.

What, if anything, can be inferred from each graph? How could one explain the difference? Can anything be said about the dataset/features/model?

My gut feeling is that in the first case, the features explain the data better so the model gets a better fit to the data (and possibly overfits it, but not necessarily – the performance on the validation/test sets could still be good if the features actually explain the data well). In the second case, the features do not explain the data well and so the model does not fit too closely to the data. The performance of the two models could still be the same in terms of precision and recall, however. Would that be correct?

Best Answer

I have prepared a short script to show what I think should be the right intuition.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import ensemble
from sklearn.model_selection import train_test_split


def create_dataset(location, scale, N):
    class_zero = pd.DataFrame({
        'x': np.random.normal(location, scale, size=N),
        'y': np.random.normal(location, scale, size=N),
        'C': [0.0] * N
    })

    class_one = pd.DataFrame({
        'x': np.random.normal(-location, scale, size=N),
        'y': np.random.normal(-location, scale, size=N),
        'C': [1.0] * N
    })
    return class_one.append(class_zero, ignore_index=True)

def preditions(values):
    X_train, X_test, tgt_train, tgt_test = train_test_split(values[["x", "y"]], values["C"], test_size=0.5, random_state=9)
    clf = ensemble.GradientBoostingRegressor()
    clf.fit(X_train, tgt_train)
    y_hat = clf.predict(X_test)
    return y_hat

N = 10000
scale = 1.0
locations = [0.0, 1.0, 1.5, 2.0]

f, axarr = plt.subplots(2, len(locations))
for i in range(0, len(locations)):
    print(i)
    values = create_dataset(locations[i], scale, N)

    axarr[0, i].set_title("location: " + str(locations[i]))

    d = values[values.C==0]
    axarr[0, i].scatter(d.x, d.y, c="#0000FF", alpha=0.7, edgecolor="none")
    d = values[values.C==1]
    axarr[0, i].scatter(d.x, d.y, c="#00FF00", alpha=0.7, edgecolor="none")

    y_hats = preditions(values)
    axarr[1, i].hist(y_hats, bins=50)
    axarr[1, i].set_xlim((0, 1))

What the script does:

  • it creates different scenarios where the two classes are progressively more and more separable - I could provide here a more formal definition of this but I guess that you should get the intuition
  • it fits a GBM regressor on the test data and outputs the predicted values feeding the test X values to the trained model

The produced chart shows how the generated data in each of the scenario looks like and it shows the distribution of the predicted values. The interpretation: lack of separability translates in predicted $y$ being at or right around 0.5.

All this shows the intuition, I guess it should not be hard to prove this in a more formal fashion although I would start from a logistic regression - that would make the math definitely easier.

figure 1


EDIT 1

I am guessing in the leftmost example, where the two classes are not separable, if you set the parameters of the model to overfit the data (e.g. deep trees, large number of trees and features, relatively high learning rate), you would still get the model to predict extreme outcomes, right? In other words, the distribution of predictions is indicative of how closely the model ended up fitting the data?

Let's assume that we have a super deep tree decision tree. In this scenario, we would see the distribution of prediction values peak at 0 and 1. We would also see a low training error. We can make the training error arbitrary small, we could have that deep tree overfit to the point where each leaf of the tree correspond to one datapoint in the train set, and each datapoint in the train set corresponds to a leaf in the tree. It would be the poor performance on the test set of a model very accurate on the training set a clear sign of overfitting. Note that in my chart I do present the predictions on the test set, they are much more informative.

One additional note: let's work with the leftmost example. Let's train the model on all class A datapoints in the top half of the circle and on all class B datapoints in the bottom half of the circle. We would have a model very accurate, with a distribution of prediction values peaking at 0 and 1. The predictions on the test set (all class A points in the bottom half circle, and class B points in the top half circle) would be also peaking at 0 and 1 - but they would be entirely incorrect. This is some nasty "adversarial" training strategy. Nevertheless, in summary: the distribution sheds like on the degree of separability, but it is not really what matters.