Solved – Why is cross_val_score substantially lower than .score or roc_auc_score

auccross-validationpythonrocscikit learn

I have a trained model, a GradientBoostingClassifier. My dataset is 60 thousand something rows of data that I've split into 66/33 train/test sets. Scoring the model via the .score() method or via sklearn.metrics.roc_auc_score() returns quite reasonable scores:

In: gbc.score(x_test, y_test)
Out: 0.8958226221079691
In: roc_auc_score(y_test, gbc.predict(x_test))
Out: 0.8899345768861056

That 'aint so bad. However when I use cross_val_score I'm getting a substantially lower value:

In: scores = cross_val_score(gbc, df, target, cv=10, scoring='roc_auc')
In: scores.mean()
Out: 0.5646406271571536

The documentation for cross_val_score says by default it uses the default .score method of the model you're using, but that passing a value to the "scoring" parameter can alter that. I've tried both but am somehow getting results that are quite wildly different from both the default .score method and from the roc_auc_score method that I assume cross_val_score uses when I pass scoring='roc_auc'.

I could understand why this might be the case if I had used GridSearchCV to tune the hyper parameters of the model; in that case I would assume that I've overfit it to the testing set of data. However:

1) I haven't done any tuning at all, it's literally using default parameters
2) I tried to test this by running the following:

val_scores = []
test_scores = []

for x in range(10):
    x_train, x_test, y_train, y_test = train_test_split(
        df,
        target,
        test_size=0.33
    )

    x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.5)

    gbc = GradientBoostingClassifier()
    gbc.fit(x_train, y_train)
    test_scores.append(gbc.score(x_test, y_test))
    val_scores.append(gbc.score(x_val, y_val))

test_scores.mean()
val_scores.mean()

And the scores on both the test and val slices are north of 0.89. There's no random_state in play here so I would expect train_test_split to be slicing randomly enough that any accidental overfitting based on the default parameters of the model really ought to be negated.

So; why might cross_val_score be reporting scores significantly lower than either .score or roc_auc_score?

Best Answer

I eventually found the answer, in this thread:

https://stackoverflow.com/questions/43688058/sklearn-cross-val-score-gives-lower-accuracy-than-manual-cross-validation

which is that by default the means through which cross_val_score slices the folds in my case, StratifiedKFold, doesn't select a random subset of the data but rather just the "top n rows" or something similar. As my data is sorted by date and then by a particular feature this means it has to predict the DV later when it's not necessarily seen some of the values in those 2 features. The answer is to explicitly pass a StratifiedKFold with the shuffle parameter set to True:

 s = cross_val_score(
    df,
    target,
    cv=StratifiedKFold(shuffle=True),
    scoring='roc_auc'
)
s.mean()
0.9595958053830161

Note that the score is higher because I think it's using predict_proba(), which makes sense.