I am working working with the World Happiness Report dataset from Kaggle. When using either cross_val_score
or GridSearchCV
from sklearn
, I get very large negative r2 scores. My first thought was that the models I was using were SEVERELY over-fitting (it is a small dataset), but when I performed cross-validation using KFold
to split the data, I got reasonable results.
You can view an example of what I am talking about in this Google Colab Notebook. The relevant code is also shown below.
Using cross_val_score
model = LinearRegression()
print(cross_val_score(model, X, y, scoring='r2', cv=5))
Output: [-5.57285067 -5.9477523 -6.23988074 -8.84930385 -2.39521998]
Using KFold
model = LinearRegression()
kf = KFold(n_splits=5, random_state=1, shuffle=True)
scores = []
for i, (train_index, test_index) in enumerate(kf.split(X)):
X_train = X[train_index,:]
y_train = y[train_index]
X_test = X[test_index,:]
y_test = y[test_index]
model.fit(X_train, y_train)
test_score = model.score(X_test, y_test)
scores.append(round(test_score, 6))
print(scores)
Output: [0.829785, 0.774577, 0.762708, 0.661945, 0.727391]
Some Additional Observations
- It doesn't seem to matter what type of model I use. I still get very large negative scores when using
cross_val_score
. - I created a synthetic dataset that was approximately the same size of the World Happiness dataset just to try some things out. In that case, I did not get large negative r2 scores from
cross_val_score
. This is shown in the Google Colab notebook that I shared above. - I notice that the magnitude of the negative results I get using
cross_val_score
is greatly affected by the number of folds I use. Increasing the number of folds significantly increases the magnitude.
Thanks in advance for your help!
Best Answer
you are fitting your whole data to it, it seems the cross_val_score awaits a predefined train_test_split object. Ignore the part with 'bla' :-). Btw. it is also the kaggle data. I used it in one of my own notebooks.