Solved – Why is cross_val_score substantially lower than .score or roc_auc_score

auccross-validationpythonrocscikit learn

I have a trained model, a GradientBoostingClassifier. My dataset is 60 thousand something rows of data that I've split into 66/33 train/test sets. Scoring the model via the .score() method or via sklearn.metrics.roc_auc_score() returns quite reasonable scores:

In: gbc.score(x_test, y_test)
Out: 0.8958226221079691
In: roc_auc_score(y_test, gbc.predict(x_test))
Out: 0.8899345768861056

That 'aint so bad. However when I use cross_val_score I'm getting a substantially lower value:

In: scores = cross_val_score(gbc, df, target, cv=10, scoring='roc_auc')
In: scores.mean()
Out: 0.5646406271571536

The documentation for cross_val_score says by default it uses the default .score method of the model you're using, but that passing a value to the "scoring" parameter can alter that. I've tried both but am somehow getting results that are quite wildly different from both the default .score method and from the roc_auc_score method that I assume cross_val_score uses when I pass scoring='roc_auc'.

I could understand why this might be the case if I had used GridSearchCV to tune the hyper parameters of the model; in that case I would assume that I've overfit it to the testing set of data. However:

1) I haven't done any tuning at all, it's literally using default parameters
2) I tried to test this by running the following:

val_scores = []
test_scores = []

for x in range(10):
    x_train, x_test, y_train, y_test = train_test_split(
        df,
        target,
        test_size=0.33
    )

    x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.5)

    gbc = GradientBoostingClassifier()
    gbc.fit(x_train, y_train)
    test_scores.append(gbc.score(x_test, y_test))
    val_scores.append(gbc.score(x_val, y_val))

test_scores.mean()
val_scores.mean()

And the scores on both the test and val slices are north of 0.89. There's no random_state in play here so I would expect train_test_split to be slicing randomly enough that any accidental overfitting based on the default parameters of the model really ought to be negated.

So; why might cross_val_score be reporting scores significantly lower than either .score or roc_auc_score?

Best Answer

I eventually found the answer, in this thread:

https://stackoverflow.com/questions/43688058/sklearn-cross-val-score-gives-lower-accuracy-than-manual-cross-validation

which is that by default the means through which cross_val_score slices the folds in my case, StratifiedKFold, doesn't select a random subset of the data but rather just the "top n rows" or something similar. As my data is sorted by date and then by a particular feature this means it has to predict the DV later when it's not necessarily seen some of the values in those 2 features. The answer is to explicitly pass a StratifiedKFold with the shuffle parameter set to True:

 s = cross_val_score(
    df,
    target,
    cv=StratifiedKFold(shuffle=True),
    scoring='roc_auc'
)
s.mean()
0.9595958053830161

Note that the score is higher because I think it's using predict_proba(), which makes sense.

Related Solutions

Solved – Scoring a classifier with ROC AUC

The ROC is simply the measure of the accuracy of any ONE given model but for different classification threshold values.

For example, a logistic regression's output will lie between 0 and 1. Generally we would classify a record as 1, if it's assigned probability is greater than 0.5.

In the ROC however, imagine each point on the line represents the performance of the same binary classifier but with varying thresholds from 0 to 1.

If I set my threshold to 0, then essentially all predicted values are 1. This will make False Positive Rate 1, and True Positive Rate 1. In the other extreme, setting a threshold of 1, will give you a TPR and FPR of 0.

Finding the point along the ROC with the highest TPR for the lowest FPR would give you the optimal threshold.

So the multiple sets of Y(predicted) that you refer to is in fact the application of multiple threshold values and computing TPR and FPR for the exact same model.

Solved – Why is a large choice of K lowering the cross validation score

r^2 score is undefined when applied to a single sample (e.g. leave-one-out CV).

r^2 is not good for evaluation of small test sets: when it's used to evaluate a sufficiently-small test set, the score can be far into the negatives despite good predictions.

Given a single sample, a good prediction for a given domain may appear terrible:

from sklearn.metrics import r2_score
true = [1]
predicted = [1.01] # prediction of a single value, off by 1%
print(r2_score(true, predicted))
# 0.0

Increase the size of the test set (keeping the accuracy of predictions the same), and suddenly the r^2 score appears near-perfect:

true = [1, 2, 3]
predicted = [1.01, 2.02, 3.03]
print(r2_score(true, predicted))
# 0.9993

Taken to the other extreme, if the test size is 2 samples, and we happen to be evaluating 2 samples that are close to each other by chance, this will have substantial impact on the r^2 score, even if the predictions are quite good:

true = [20.2, 20.1] # actual target values from the Boston Housing dataset
predicted = [19, 21]
print(r2_score(true, predicted))
# -449.0

Best Answer

Related Solutions

Solved – Scoring a classifier with ROC AUC

Solved – Why is a large choice of K lowering the cross validation score

Related Question