Solved – Why is a large choice of K lowering the cross validation score

cross-validationmachine learningrandom forestsample-sizescikit learn

Playing around with the Boston Housing Dataset and RandomForestRegressor (w/ default parameters) in scikit-learn, I noticed something odd: mean cross-validation score decreased as I increased the number of folds beyond 10. My cross-validation strategy was as follows:

cv_met = ShuffleSplit(n_splits=k, test_size=1/k)
scores = cross_val_score(est, X, y, cv=cv_met)

… where num_cvs was varied. I set test_size to 1/num_cvs to mirror the train/test split size behavior of k-fold CV. Basically, I wanted something like k-fold CV, but I also needed randomness (hence ShuffleSplit).

This trial was repeated several times, and avg scores and standard deviations were then plotted.

Area of circle ~ K in K-fold cross-validation

(Note that the size of k is indicated by the area of the circle; standard deviation is on the Y axis.)

Consistently, increasing k (from 2 to 44) would yield a brief increase in score, followed by a steady decrease as k increased further (beyond ~10 folds)! If anything, I would expect more training data to lead to a minor increase in score!

Update

Changing the scoring criteria to mean absolute error results in behavior that I'd expect: scoring improves with an increased number of folds in K-fold CV, rather than approaching 0 (as with the default, 'r2'). The question remains why the default scoring metric results in poor performance across both mean and STD metrics for an increasing number of folds.

Best Answer

r^2 score is undefined when applied to a single sample (e.g. leave-one-out CV).

r^2 is not good for evaluation of small test sets: when it's used to evaluate a sufficiently-small test set, the score can be far into the negatives despite good predictions.

Given a single sample, a good prediction for a given domain may appear terrible:

from sklearn.metrics import r2_score
true = [1]
predicted = [1.01] # prediction of a single value, off by 1%
print(r2_score(true, predicted))
# 0.0

Increase the size of the test set (keeping the accuracy of predictions the same), and suddenly the r^2 score appears near-perfect:

true = [1, 2, 3]
predicted = [1.01, 2.02, 3.03]
print(r2_score(true, predicted))
# 0.9993

Taken to the other extreme, if the test size is 2 samples, and we happen to be evaluating 2 samples that are close to each other by chance, this will have substantial impact on the r^2 score, even if the predictions are quite good:

true = [20.2, 20.1] # actual target values from the Boston Housing dataset
predicted = [19, 21]
print(r2_score(true, predicted))
# -449.0