Solved – YOLOv3 loss function

loss-functionsneural networksyolo

Follow-up to stats.stackexchange.com/questions/373266/yolo-v3-loss-function:

In trying to finalize the development of my training labels and loss function I'm confused by the part in bold in the quote below (from the YOLOv3 paper). I'm considering that "bounding box prior" is synonymous with "anchor".

YOLOv3 predicts an objectness score for each bounding box using logistic regression. This should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior. If the bounding box prior is not the best but does overlap a ground truth object by more than some threshold, we ignore the prediction, following [17]. We use the threshold of .5. Unlike [17] our system only assigns one bounding box prior for each ground truth object. If a bounding box prior is not assigned to a ground truth object, it incurs no loss for coordinate or class predictions, only objectness.

Question 1

Is the bold portion above saying that if there is more than one anchor with an IOU > 0.5, then the ground truth object is not considered at all? It makes sense that the anchor with the highest/best IOU be responsible for a particular ground truth object, but the threshold doesn't make sense to me. It seems to be implying a threshold being considered for the anchor with the second-highest IOU ("not the best but does overlap a ground truth object by more than some threshold").

Question 2

Does YOLOv3 still make use of $\lambda_{coord}$ and $\lambda_{noobj}$? Assuming so and putting it all together, does the below loss function look correct? The below assumes a prediction vector of $t_x$, $t_y$, $t_w$, $t_h$, $t_o$, $s_1$, $…$, $s_C$ and a corresponding ground truth label of $\hat{t}_x$, $\hat{t}_y$, $\hat{t}_w$, $\hat{t}_h$, $\hat{y}_o$, $\hat{y}_1$, $…$, $\hat{y}_C$, where C equals the number of total classes, $y \in \{0,1\}$, and $BCE$ represents binary cross-entropy.

$$ \lambda_{coord} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}^{obj}_{i,j} \big[ (t_x – \hat{t}_x)^2 + (t_y – \hat{t}_y)^2 + (t_w – \hat{t}_w)^2 + (t_h – \hat{t}_h)^2 \big] \\
+ \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}^{obj}_{i,j} \big[ – log(\sigma(t_o)) + \sum_{k=1}^{C} BCE(\hat{y}_k, \sigma(s_k)) \big] \\
+ \lambda_{noobj} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}^{noobj}_{i,j} \big[ -log(1-\sigma(t_o)) \big] $$

Best Answer

if you read the R-CNN paper it references it clarifies it a bit.

  1. First, one predicted box is assigned to the ground truth based on which predicted box has the IoU. Then you have all the other predicted boxes that may not have had the highest IoU but do have an IoU over 0.5 with the object. These predicted boxes are not assigned to a ground truth but from what I understand, they are not included in the loss function (i.e. the last section for no object). Only predicted boxes which have an IoU of less than 0.5 with any object are considered in the no object loss. This is somewhat confusing as the approach to this has changed over the different iterations of YOLO. In the first YOLO there was no such threshold and these predicted boxes were included in the loss function, this was justified by the idea that each box specialises in detecting one object, but that approach doesn't seem to be used in YOLOv3. Perhaps because of the large increase in the number of boxes predicted compared to YOLOv1 and YOLOv2, this specialisation is no longer needed.

  2. I can't say exactly as the loss function was never explicitly given in YOLOv3 but I think it is almost correct. The one thing to add is that as there are 3 scales of detection for YOLOv3, the loss function you have will only sum over one of these scales. In YOLOv3 $S=13,26,52$ for all the scales.

P.S. I don't think bounding box prior is exactly synonymous with anchor box here but means the final prediction after calculating $b_x, b_y, b_h, b_w$, whereas anchor box is the predefined box which is determined by K-means testing on the training set ground truth boxes. But I may be wrong on that.