Measuring Precision/Recall on a biased sample

classificationmodel-evaluationprecision-recallsamplingunbalanced-classes

I am working with ML models that predict e.g. whether an email violates some corporate policy or not. In this case, the "positives" are emails that violate the policy, and the number of positives is very low – the vast majority of emails does not violate the policy, and those are "negatives". For the model, the scores are between 0 and 1, and higher scores mean higher likelihood of policy violation (it's a binary classification problem with very imbalanced classes).

In order to measure the precision and recall of the model, defined as usual, we will send a sample of emails scored by the model to human reviewers (let's assume that human review is the ground truth). The human reviewers will not see the model scores.

There are suggestions in the team that we should use the model scores as sampling weights: if an email has a higher model score, it will have more chances of being in the sample sent to the human reviewers. The reasoning is that this biased sampling procedure would be "more representative of the general population", in particular when it comes to evaluating the recall.

Is such a biased sampling procedure actually going to result in a better estimation of the recall? Does this biased sampling approach have any merit at all compared to a uniform sample?

Personally, I think that the model scores are not particularly representative of the complete population, but rather, that they are only representative of what the model thinks, so I find it very questionable that the model scores would be used to bias the sample. If anything, that procedure would pull in the sample more positives according to the model, so we will have more true positives (TP) and false positives (FP), which are the ingredients in the definition of the precision, TP/(TP+FP). The recall, TP/(TP+FN), needs TP and FN (false negatives), so it would benefit from seeing more FNs, rather than more FPs.

Note: this question is really about the procedure to evaluate the precision/recall of the model, and whether that procedure should include any bias or simply draw samples uniformly. I can't seem to find a real argument to defend the biased approach.

Best Answer

The ideal thing would have been to have set aside an already-labeled testing set drawn randomly from the real population. I'm assuming that's too late, and will hazard an answer, but don't think of it as authoritative.

Given the imbalance, sending a random sample of new emails to humans to label would require a lot of effort to get a reasonable number of positives in the sample, and so may not be worth the cost. So trying to increase the number of positives in the sample seems a worthwhile goal. To report accurate metrics, you should then weight the samples to reverse the effects of sampling. (For example, if you split your fresh emails into groups "high risk" and "low risk" with proportions 5% and 95% respectively, but sample so that the human labelers get a balanced set, then you effectively have multiplied the apparent number of high risk emails by 19, and so you could multiply the TN and FP counts by 19 before computing metrics.)

Using the model you are trying to evaluate as the weighting scheme is another problem. In particular, perhaps your model is blind to a certain kind of policy violation; then you won't up-sample those, and so your human reviewers may not actually label (many of) them, giving an optimistic count of the FNs.

But I don't see an easy way around it. Maybe using a simpler but better-than-random model for the selection? You should still then mention that caveat in the results.

...more true positives (TP) and false positives (FP) [...] The recall, TP/(TP+FN), needs TP and FN (false negatives), so it would benefit from seeing more FNs, rather than more FPs.

Seeing an optimistically biased TP will still push recall toward 1. But again here by weighting the counts you can partially rectify this bias.