Object Detection – Understanding Average Precision Metrics

average-precisionmachine learningmodel-evaluationobject detectionprecision-recall

I'm quite confused as to how I can calculate the AP or mAP values as there seem to be quite a few different methods. I specifically want to get the AP/mAP values for object detection.

All I know for sure is:

Recall = TP/(TP + FN),
Precision = TP/(TP + FP)

For example, if I only have 1 class to evaluate, and say 500 test images. Each test image may have different number of predictions (bounding box proposals) but each image only has one ground-truth bounding box.

Image 1: [class, probability, x1, y1, x2, y2], [class, probability, x3, y3, x4, y4], [class, probability, x5, y5, x6, y6], [class, probability, x7, y7, x8, y8], …

Image 2: [class, probability, x1, y1, x2, y2], [class, probability, x3, y3, x4, y4], …

.
.
. (and so on)

*just an example, I made this up

I know that to get TP, we'd have to find the IOUs of each prediction and count the ones above a selected threshold such as 0.5 (if we have multiple predictions with IOUs above the threshold, do we only count once and treat the others as FP?).

This is where it puzzles me:

  1. Would the TP+FP = # of predictions made for each image?

  2. Since all test images have no negatives, TP+FN = 500?

  3. Is it calculated per image, or per class?

  4. Could someone let me know a step by step guide to get the AP/mAP based on my example? I find the most ambiguous part is whether we do it per image, or per class (i.e. 500 images all at once).

Most guides/papers I found are very targeted towards information retrieval. Would appreciate some help in this.

*Note: I am testing it on some custom dataset. I know PASCAL VOC has some code to do it, but I want to write the code myself, customised to my own data.

Best Answer

I think the accepted answer direct the wrong way to compute mAP. Because even for each class, the AP is the average product. In my answer I will still include the interpretation of IOU so the beginners will have no hardness to understand it.

For a given task of object detection, participants will submit a list of bounding boxes with confidence(the predicted probability) of each class. To be considered as a valid detection, the proportion of the area of overlap $a_o$ between predicted bounding box $b_p$ and ground true bounding $b_t$ to the overall area have to exceeds 0.5. The corresponding formula will be: $$ a_o = \frac{Area(b_p \cap b_t)}{Area(b_p \cup b_t)} $$

After we sift out a list of valid $M$ predicted bounding boxes, then we will evaluate each class as a two-class problem independently. So for a typical evaluation process of 'human' class. We may first list these $M$ bounding box as following:

Index of Object, Confidence, ground truth

Bounding Box 1, 0.8, 1

Bounding Box 1, 0.7, 1

Bounding Box 2, 0.1, 0

Bounding Box 3, 0.9, 1

And then, you need to rank them by the confidence from high to low. Afterwards, you just need to compute the PR curve as usual and figure out 11 interpolated precision results at these 11 points of recall equals to [0, 0.1, ..., 1].(The detailed calculated methods is in here) Worth mentioning for multiple detections of a single bounding box, e.g. the bounding box 1 in my example, We will at most count it as correct once and all others as False. Then you iterate through 20 classes and compute the average of them. Then you get your mAP.

And also, for now we twist this method a little to find our mAP. Instead of using 10 breaking points of recall, we will use the true number K of specific class and compute the interpolated precison. i.e. [0,1/K,2/K...]

Related Question