Solved – How to evaluate whether model is overfitting or underfitting when using cross_val_score and GridSearchCV

cross-validationmachine learning

This is something that has been written about extensively, but I'm just confused about a couple of particular things which I haven't found a clear explanation of.

When cross validation is not used, data can be split into train and test, and trained on the train set. The model can then be evaluated on both sets and the goal is to have similar performance on either set, which means neither over/under-fit.

As far as I understand, when cross-validation is used, this removes the need to split into train and test sets, since CV effectively performs this split a number of times (defined by the number of folds). However, averaging scores you get from cross validation returns just a single score. Should this be interpreted as the train or the test score from the previous case? or neither? How can we tell if the model is overfit or underfit?

I am wondering how this fits in with GridSearchCV, since I have read that you are supposed to split your data into a train and validation set to confirm that your performance metric remains approximately the same. Is this necessary since we can just assume the model is not over/under-fit since we allow GridSearchCV to choose the best hyperparameters?

Furthermore, I have read something confusing in "Introduction to Machine Learning with Python" which says that data should be split into 3: train, val and test. The model is trained on the training set, and evaluated on the validation set in order to choose the best hyperparameters, and then taking the best hyperparameters is trained on train+val, and evaluated on test.

Best Answer

You need to check the accuracy difference between train and test set for each fold result. If your model gives you high training accuracy but low test accuracy so your model is overfitting. If your model does not give good training accuracy you can say your model is underfitting.

GridSearchCV is trying to find the best hyperparameters for your model. To do this, it splits the dataset into three-part. It uses a train set for the training part then test your data with validation set and tuning your parameters based on the validation set results. Finally, it uses test set to take the final model accuracy.

from sklearn.model_selection import KFold

kf = KFold(n_splits=5,random_state=42,shuffle=True)


# these are you training data points:
# features and targets
X = ....
y = ....

accuracies = []

for train_index, test_index in kf.split(X):

    data_train   = X[train_index]
    target_train = y[train_index]

    data_test    = X[test_index]
    target_test  = y[test_index]

    # if needed, do preprocessing here

    clf = LogisticRegression()
    clf.fit(data_train,target_train)

    test_preds = clf.predict(data_test)
    test_accuracy = accuracy_score(target_test,test_preds)

    train_preds = clf.predict(data_train)
    train_accuracy = accuracy_score(target_train, train_preds)

    print(train_accuracy, test_accuracy, (train_accuracy - test_accuracy) )

    accuracies.append(accuracy)

# this is the average accuracy over all folds
average_accuracy = np.mean(accuracies)