Python – Can You Average Precision and Recall After K-Fold Cross Validation?

cross-validationprecision-recallpythonscikit learn

I have created a 5-fold cross validation model and used cross_val_score function to calculate the precision and recall of the cross validated model as follows:

def print_accuracy_report(classifier, X, y, num_validations=5):
    precision = cross_validation.cross_val_score(classifier, 
            X, y, scoring='precision', cv=num_validations)
    print "Precision: " + str(round(100*precision.mean(), 2)) + "%"


    recall = cross_validation.cross_val_score(classifier, 
            X, y, scoring='recall', cv=num_validations)
    print "Recall: " + str(round(100*recall.mean(), 2)) + "%" 

I wonder if I'm allowed to do these lines:

    print "Precision: " + str(round(100*precision.mean(), 2)) + "%"
    print "Recall: " + str(round(100*recall.mean(), 2)) + "%" 

I mean does this precision.mean() and recall.mean() represent the precision and recall of the whole model?

Just for comparison's sake, in the scikit-learn's documentation I've seen the model's accuracy is calculated as :

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)


print(scores)                                          

array([0.96…, 1. …, 0.96…, 0.96…, 1. ])

    print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.98 (+/- 0.03)

Best Answer

First of all, when you do 5-fold cross validation, you don't have one model, you have five. So it's not really correct to talk about the precision/recall of the "whole model" since there isn't just one. Rather, you're getting an estimate of the precision/recall from your model-building process.

That said, each fold is a model with its own precision and recall, and you can average them to get a mean performance metric over all your folds. One thing to note, though, that since recall is the proportion of true positives out of all positives, you'll have to weight each fold by the number of positives.

Imagine a case where you have 4 folds that each have only one positive, which is correctly identified, giving you 100% recall on those folds. The fifth fold has 96 positives, 46 of which are correctly identified, for 48% recall . A straight mean would give you a recall of 90%, but if you account for the greater number of positives in the fifth fold, your overall recall rate is only 50% (50 of 100 positives identified). If your folds are well-stratified, this problem would fix itself for recall, but for precision, which depends on the number of predicted positives in each fold, I don't see any way to stratify before doing prediction (you'd have to know the prediction output before defining the folds and training the model). I would implement the weighted average method, as it will work for any metric you choose to compute, and will work in cases where perfectly equal stratification isn't possible (when N isn't evenly divisble by K).

Another approach suggested in the comments, which would be equivalent to weighted averages of summary metrics, is to sum the prediction confusion matrices from each fold and compute summary statistics from the combined matrix. By summing the TP, TN, FP, and FN from all folds and then computing precision/recall, you are implicitly accounting for any differences in prevalence of positive cases or positive predictions across folds.