Solved – Cross validation and negative score

cross-validationmachine learning

I am new to machine learning. I am using both HuberRegressor and Linear Regression for my data and used cross_val_score with split of 5 and 10.
I get scores as positive for both Huber and LinearRegression when splits =5 but below values for split of 10

HuberR — printing scores with Kfold of 10 = [-0.89745286 0.57398566 0.89670278 0.71272131 0.67122895 0.37063536
0.34396314 0.91340008 0.71485618 0.74122021]

LinearR — printing scores with Kfold of 10 = [-0.25560712 0.53450138 0.88401398 0.77523712 0.66942213 0.4324412
0.30291753 0.98206453 0.76385236 0.7207619 ]

Can somebody explain if HuberRegressor or LinearRegression is better model and how to explain a negative score in both models?

I am using scores as below
cv1 = KFold(n_splits=10)
scores = cross_val_score(pipeline1,X,y,cv=cv1)

The values listed above are from results of cross_val_score. I used these from sklearn.

I tried adding "shuffle=True" in KFold and I do not get negative values. I would still like if some one can explain the beahvior a little deeper.

Best Answer

I think that the validation you are doing is how one determines the best model. Average all of those scores, and the model with the highest average score is the better one. I've done that for you here:

Huber: 0.504

Linear: 0.581

Without seeing your dataset, I am not sure why you are getting a negative score. Generally that means that the model you have fit is worse than the null hypothesis, that a straight line with slope of 0 is a better fit than the model you created. That being said, you notice that shuffle=True causes only positive results.

cv1 = KFold(n_splits=10, shuffle=True)

If your target is ordered in the dataframe, such as from smallest to largest, you might get a bad fit, resulting in a negative score. Shuffling the data will fix that by causing you to build a model that represents a random sample of your data. In this case your folds would be representative of the entire dataset, instead of some small, statistically distinct region.

Check the averages from the KFold validation using the shuffle, and see how they compare to those without a shuffle. I'd recommend using the shuffled values to determine which model you should use.