Solved – Large Negative r-Squared Scores using Cross-Validation

cross-validationpythonscikit learn

I am working working with the World Happiness Report dataset from Kaggle. When using either cross_val_score or GridSearchCV from sklearn, I get very large negative r2 scores. My first thought was that the models I was using were SEVERELY over-fitting (it is a small dataset), but when I performed cross-validation using KFold to split the data, I got reasonable results.

You can view an example of what I am talking about in this Google Colab Notebook. The relevant code is also shown below.

Using cross_val_score

model = LinearRegression()
print(cross_val_score(model, X, y, scoring='r2', cv=5))

Output: [-5.57285067 -5.9477523 -6.23988074 -8.84930385 -2.39521998]

Using KFold

model = LinearRegression()
kf = KFold(n_splits=5, random_state=1, shuffle=True)
scores = []

for i, (train_index, test_index) in enumerate(kf.split(X)):
    X_train = X[train_index,:]
    y_train = y[train_index]
    X_test = X[test_index,:]
    y_test = y[test_index]

    model.fit(X_train, y_train)
    test_score = model.score(X_test, y_test)
    scores.append(round(test_score, 6))

print(scores)

Output: [0.829785, 0.774577, 0.762708, 0.661945, 0.727391]

Some Additional Observations

It doesn't seem to matter what type of model I use. I still get very large negative scores when using cross_val_score.
I created a synthetic dataset that was approximately the same size of the World Happiness dataset just to try some things out. In that case, I did not get large negative r2 scores from cross_val_score. This is shown in the Google Colab notebook that I shared above.
I notice that the magnitude of the negative results I get using cross_val_score is greatly affected by the number of folds I use. Increasing the number of folds significantly increases the magnitude.

Thanks in advance for your help!

Best Answer

you are fitting your whole data to it, it seems the cross_val_score awaits a predefined train_test_split object. Ignore the part with 'bla' :-). Btw. it is also the kaggle data. I used it in one of my own notebooks.

Related Solutions

Solved – feature scaling giving reduced output (linear regression using gradient descent)

My guess is that you have accidentally transformed y_train (somewhere hidden in the code you have not posted). This because this reproducible snippets works

import numpy as np
import pandas as pd
import math
from sklearn import preprocessing

dat = pd.read_csv("/home/steffen/workspaces/airfoil/airfoil_self_noise.dat",sep="\t",low_memory=False,header=None)

apply_scaler = True

# split into train 2/3 and test 1/3
rng = np.random.RandomState(42)

n_rows = dat.shape[0]
n_train = math.floor(0.66*n_rows)

permutated_indices = rng.permutation(n_rows)

train_dat = dat.loc[permutated_indices[:n_train],:]
test_dat =  dat.loc[permutated_indices[n_train:],:]

# separate the response variable (last column) from the predictor variables
x_train = train_dat.iloc[:,1:-1]
y_train = (train_dat.iloc[:,-1])[:, np.newaxis]

x_test = test_dat.iloc[:,1:-1]
y_test = (test_dat.iloc[:,-1])[:, np.newaxis]

# train
# fit the scaler to predictor variables and apply it afterwards
scaler = preprocessing.StandardScaler().fit(x_train)

if apply_scaler:
    x_train = pd.DataFrame(scaler.transform(x_train))

# add constant one for the intercept parameter
x_train = pd.concat([pd.DataFrame(np.ones(shape=(x_train.shape[0],1)),index=x_train.index),x_train],axis=1)

# fit parameters of linear regression using batch gradient descent
# Hands-On Machine Learning with Scikit-Learn & Tensorflow, page 115
eta = 0.1 # learning rate
n_iterations = 1000
m = x_train.shape[0]
theta = rng.randn(x_train.shape[1],1)

for iteration in range(n_iterations):
    gradients = (2 / m) * x_train.T.dot(x_train.dot(theta) - y_train)
    theta = theta - eta * gradients

# to apply the fitted parameters, first we have to transform the test-data in the same way
# apply scaler
if apply_scaler:
    x_test = pd.DataFrame(scaler.transform(x_test))

# add constant one for the intercept parameter
x_test = pd.concat([pd.DataFrame(np.ones(shape=(x_test.shape[0],1)),index=x_test.index),x_test],axis=1)

# apply fitted parameters
y_predict =x_test.dot(theta)

# compare output
out=np.column_stack((y_test, y_predict))
print(pd.DataFrame(out).head())
# root mean squared error
print("error %f"% np.sqrt(np.power(y_test-y_predict,2).mean()))

This leads to this output

         0           1
0  120.573  127.108268
1  127.220  123.492931
2  113.045  122.393120
3  119.606  122.570836
4  131.971  127.270743
error 6.175637

which is fine.

It is interesting to see that for learning rate 0.1 this simple batch gradient descent implementation fails to converge if no normalization is performed (apply_scaler=False, eta=0.1), while the Linear Regression implementation of scikit learn still finds a solution. Reducing the learning rate dramatically (eta=0.0001) leads to convergence again.

This is one example where the Gradient Descent is limited, as discussed here: Do we need gradient descent to find the coefficients of a linear regression model.

Solved – Cross validation and negative score

I think that the validation you are doing is how one determines the best model. Average all of those scores, and the model with the highest average score is the better one. I've done that for you here:

Huber: 0.504

Linear: 0.581

Without seeing your dataset, I am not sure why you are getting a negative score. Generally that means that the model you have fit is worse than the null hypothesis, that a straight line with slope of 0 is a better fit than the model you created. That being said, you notice that shuffle=True causes only positive results.

cv1 = KFold(n_splits=10, shuffle=True)

If your target is ordered in the dataframe, such as from smallest to largest, you might get a bad fit, resulting in a negative score. Shuffling the data will fix that by causing you to build a model that represents a random sample of your data. In this case your folds would be representative of the entire dataset, instead of some small, statistically distinct region.

Check the averages from the KFold validation using the shuffle, and see how they compare to those without a shuffle. I'd recommend using the shuffled values to determine which model you should use.

Best Answer

Related Solutions

Solved – feature scaling giving reduced output (linear regression using gradient descent)

Solved – Cross validation and negative score

Related Question