Solved – GridSearch returns different result than metrics.precision_score

cross-validationscikit learn

I have a quite simple text classification setup where i need to optimize the precision score. I use scikit-learn with a LinearSVC and a TfidfVectorizer. To find the optimal parameters, i use a GridSearchCV as in the scikit-learn example.

My data set consists of 3400 text samples, from which 450 are labeled as 1. Therefore, i set the class_weight parameter of the SVM to 'auto', as is suggested in the documentation (it has been renamed to 'balanced' in the latest version of scikit).

training = load_training_data(some_file.json)

d_train = training['data']
d_test = training['target']

x_train, x_test, y_train, y_test = train_test_split(
    d_train, d_test, test_size=0.33)

vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform(x_train)
X_test = vectorizer.transform(x_test)

param_grid = {
    'C': [0.01, 0.1, 1, 10, 100, 1000],
}

grid = GridSearchCV(
    LinearSVC(class_weight='auto'),
    param_grid=param_grid,
    scoring='precision',
    cv=5
)

grid.fit(X_train, y_train)
pred = grid.predict(X_test)

print(grid.best_score_)                      # returns 0.829
print(metrics.precision_score(y_test, pred)) # returns 0.768

now from my understanding, shouldn't the last 2 values be the same? shouldn't grid.best_score_ return the best precision found and that should be equal to the precision_score calculated by the metrics module? The values actually differ quite a bit and i am still trying to figure out why.

Best Answer

grid.best_score_ is the result of cross-validation on train dataset while metrics.precision_score(y_test, pred) is calculated on the test dataset prediction.

Related Solutions

Solved – cross_val_score Scikit Learn not giving expected result

cross_val_score is a helper function that plugs your X and Y inputs into an estimator (that you specify), trains the model, and looks at the results. I suspect that what's happening here is that cross_val_score isn't calculating results directly on the X and Y that you've provided, but rather it's training the logistic regression and comparing the predicted results to your Y instead. Since your X only has a single feature, the resulting models probably aren't going to be very effective.

To get the results you're expecting given those inputs, you'll want to use something like the classification_matrix function, which simply performs precision/recall calculations on whatever data you provide rather than training a model first. Here's what that looks like:

import numpy as np
from sklearn.metrics import classification_report

#Example 1
print ('EXAMPLE 1 RESULTS:')
X1 = np.array([[1], [1], [1], [1], [1], [1], [1], [1], [0], [0], [0], [0], [0], [0], [0], [0]])
#The line below was cut off in your example, but I think this is what it was supposed to be
y1 = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
print(classification_report(X1, y1))

print("")

#Example 2
print ('EXAMPLE 2 RESULTS:')
X2 = np.array([[1], [1], [1], [1], [1], [1], [1], [1], [0], [0], [0], [0], [0], [0], [0], [0]])
y2 = np.array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
print(classification_report(X2, y2))

This gives you the recall results you'd expect:

EXAMPLE 1 RESULTS:
             precision    recall  f1-score   support

          0       0.80      1.00      0.89         8
          1       1.00      0.75      0.86         8

avg / total       0.90      0.88      0.87        16


EXAMPLE 2 RESULTS:
             precision    recall  f1-score   support

          0       0.67      1.00      0.80         8
          1       1.00      0.50      0.67         8

avg / total       0.83      0.75      0.73        16

Solved – GridSearchCV returns better parameters with cv=5 than with cv=10

Most probably you are just seeing the uncertainty of your performance estimation.

The cross validation estimates are subject to (at least)

some pessimistic bias due to not training on the full data set (I'd expect negligible differences between 5-fold and 10-fold CV)
possibly optimistic bias if the splitting leaves dependence between training and test sets (can be large)
variance due to instability of the surrogate models (can be measured from repeated/iterated CV results - typically trees suffer from this)
variance due to the limited number of tested cases, in your case:
- 95 % confidence interval for 80 % observed accuracy with a test set of size 262: 75 - 85 %
(variance due to lack of representativeness of the data at hand: this does not play a role here as you are interested only in the one existing titanic data)

Now the optimizer has 80 % of the data set, i.e. 1047 cases. If it observes 80 % accuracy, this estimate has about 77 - 82 % range for a 95 % confidence interval based on finite-number-of-tested-cases only. In other words, there may be a number of models in the optimization that the optimizer cannot really distinguish. It will pick a parameter set that appears to be the best but grid search does not guard against the variance sources discussed above. This causes further instability = variance on the model: the hyperparameter choice is then unstable (see also Cawley & Talbot's paper).
You can check this by running GridSearchCV several times with different CV splits and checking the distribution of observed test set accuracy (and also the returned hyperparameters).

In addition, if the CVs were run with new splits into train and test set, already the variance due to testing accuracy with 262 cases can explain the observed difference. Or, statistically speaking, you cannot reject the null hypothesis that both models have equal performance.

Best Answer

Related Solutions

Solved – cross_val_score Scikit Learn not giving expected result

Solved – GridSearchCV returns better parameters with cv=5 than with cv=10

Related Question