Solved – Averaging precision and recall when using cross validation

classificationcross-validationprecision-recall

I have performed classification using multiple classifiers for a 2-classes labelled data, and I used 5-fold cross validation. For each fold I calculated tp, tn, fp, and fn. Then I calculated the accuracy, precision, recall and F-score for each test. My question is, when I want to average the results, I took the average of accuracies, but can I average precision, recall and the F-score as well? Or would this be mathematically wrong?
P.S. The datasets used in each fold are well balanced in terms of the number of instances per class.

Thanks.

Best Answer

The $F$-score, assuming you're using the usual definition, is already a combination of the precision and recall. Specifically, it is the harmonic mean of them. In other words $$F_1 = 2\cdot\frac{\textrm{precision} \cdot \textrm{recall}}{\textrm{precision} + \textrm{recall}}$$ It's meant to capture the 'effectiveness' of a system where the user places equal weights on precision and recall. There's an extension, called the $F_\beta$ score, which gives $\beta$ times more weight to recall than precision. $$ F_\beta = (1+\beta^2) \frac{\textrm{precision} \cdot \textrm{recall}}{(\beta^2 \cdot\textrm{precision}) + \textrm{recall}} $$ On the other hand, if you're asking whether you can average the 5 $F$ scores (one from each fold), then the answer is yes. In fact, that's the typical way to report a system's performance!

Just be aware that there are some issues with using these values to make inferences about the classifiers' generalization error. For example, a $t$-test between the $F$ scores for one classifier and the $F$ scores for another classifier is going to be too optimistic.