cross_val_score
is a helper function that plugs your X and Y inputs into an estimator (that you specify), trains the model, and looks at the results. I suspect that what's happening here is that cross_val_score
isn't calculating results directly on the X and Y that you've provided, but rather it's training the logistic regression and comparing the predicted results to your Y instead. Since your X only has a single feature, the resulting models probably aren't going to be very effective.
To get the results you're expecting given those inputs, you'll want to use something like the classification_matrix
function, which simply performs precision/recall calculations on whatever data you provide rather than training a model first. Here's what that looks like:
import numpy as np
from sklearn.metrics import classification_report
#Example 1
print ('EXAMPLE 1 RESULTS:')
X1 = np.array([[1], [1], [1], [1], [1], [1], [1], [1], [0], [0], [0], [0], [0], [0], [0], [0]])
#The line below was cut off in your example, but I think this is what it was supposed to be
y1 = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
print(classification_report(X1, y1))
print("")
#Example 2
print ('EXAMPLE 2 RESULTS:')
X2 = np.array([[1], [1], [1], [1], [1], [1], [1], [1], [0], [0], [0], [0], [0], [0], [0], [0]])
y2 = np.array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
print(classification_report(X2, y2))
This gives you the recall results you'd expect:
EXAMPLE 1 RESULTS:
precision recall f1-score support
0 0.80 1.00 0.89 8
1 1.00 0.75 0.86 8
avg / total 0.90 0.88 0.87 16
EXAMPLE 2 RESULTS:
precision recall f1-score support
0 0.67 1.00 0.80 8
1 1.00 0.50 0.67 8
avg / total 0.83 0.75 0.73 16
Most probably you are just seeing the uncertainty of your performance estimation.
The cross validation estimates are subject to (at least)
- some pessimistic bias due to not training on the full data set (I'd expect negligible differences between 5-fold and 10-fold CV)
- possibly optimistic bias if the splitting leaves dependence between training and test sets (can be large)
- variance due to instability of the surrogate models (can be measured from repeated/iterated CV results - typically trees suffer from this)
- variance due to the limited number of tested cases, in your case:
- 95 % confidence interval for 80 % observed accuracy with a test set of size 262: 75 - 85 %
- (variance due to lack of representativeness of the data at hand: this does not play a role here as you are interested only in the one existing titanic data)
Now the optimizer has 80 % of the data set, i.e. 1047 cases. If it observes 80 % accuracy, this estimate has about 77 - 82 % range for a 95 % confidence interval based on finite-number-of-tested-cases only. In other words, there may be a number of models in the optimization that the optimizer cannot really distinguish. It will pick a parameter set that appears to be the best but grid search does not guard against the variance sources discussed above. This causes further instability = variance on the model: the hyperparameter choice is then unstable (see also Cawley & Talbot's paper).
You can check this by running GridSearchCV
several times with different CV splits and checking the distribution of observed test set accuracy (and also the returned hyperparameters).
In addition, if the CVs were run with new splits into train and test set, already the variance due to testing accuracy with 262 cases can explain the observed difference. Or, statistically speaking, you cannot reject the null hypothesis that both models have equal performance.
Best Answer
grid.best_score_ is the result of cross-validation on train dataset while metrics.precision_score(y_test, pred) is calculated on the test dataset prediction.