Solved – Large Negative r-Squared Scores using Cross-Validation

cross-validationpythonscikit learn

I am working working with the World Happiness Report dataset from Kaggle. When using either cross_val_score or GridSearchCV from sklearn, I get very large negative r2 scores. My first thought was that the models I was using were SEVERELY over-fitting (it is a small dataset), but when I performed cross-validation using KFold to split the data, I got reasonable results.

You can view an example of what I am talking about in this Google Colab Notebook. The relevant code is also shown below.

Using cross_val_score

model = LinearRegression()
print(cross_val_score(model, X, y, scoring='r2', cv=5))

Output: [-5.57285067 -5.9477523 -6.23988074 -8.84930385 -2.39521998]

Using KFold

model = LinearRegression()
kf = KFold(n_splits=5, random_state=1, shuffle=True)
scores = []

for i, (train_index, test_index) in enumerate(kf.split(X)):
    X_train = X[train_index,:]
    y_train = y[train_index]
    X_test = X[test_index,:]
    y_test = y[test_index]

    model.fit(X_train, y_train)
    test_score = model.score(X_test, y_test)
    scores.append(round(test_score, 6))

print(scores)

Output: [0.829785, 0.774577, 0.762708, 0.661945, 0.727391]

Some Additional Observations

  • It doesn't seem to matter what type of model I use. I still get very large negative scores when using cross_val_score.
  • I created a synthetic dataset that was approximately the same size of the World Happiness dataset just to try some things out. In that case, I did not get large negative r2 scores from cross_val_score. This is shown in the Google Colab notebook that I shared above.
  • I notice that the magnitude of the negative results I get using cross_val_score is greatly affected by the number of folds I use. Increasing the number of folds significantly increases the magnitude.

Thanks in advance for your help!

Best Answer

you are fitting your whole data to it, it seems the cross_val_score awaits a predefined train_test_split object. Ignore the part with 'bla' :-). Btw. it is also the kaggle data. I used it in one of my own notebooks.

Solve

Related Question