Solved – Yolo v3 loss function

loss-functionsneural networksobject detectionyolo

The original loss function can be seen here and is more or less explained in Yolo Loss function explanation:

\begin{align}
&\lambda_{coord} \sum_{i=0}^{S^2}\sum_{j=0}^B \mathbb{1}_{ij}^{obj}[(x_i-\hat{x}_i)^2 + (y_i-\hat{y}_i)^2 ] \\&+ \lambda_{coord} \sum_{i=0}^{S^2}\sum_{j=0}^B \mathbb{1}_{ij}^{obj}[(\sqrt{w_i}-\sqrt{\hat{w}_i})^2 +(\sqrt{h_i}-\sqrt{\hat{h}_i})^2 ]\\
&+ \sum_{i=0}^{S^2}\sum_{j=0}^B \mathbb{1}_{ij}^{obj}(C_i – \hat{C}_i)^2 + \lambda_{noobj}\sum_{i=0}^{S^2}\sum_{j=0}^B \mathbb{1}_{ij}^{noobj}(C_i – \hat{C}_i)^2 \\
&+ \sum_{i=0}^{S^2} \mathbb{1}_{i}^{obj}\sum_{c \in classes}(p_i(c) – \hat{p}_i(c))^2 \\
\end{align}

However, I still have a few questions relating to the above equation, and how (or if) the loss changed in YOLOv3. For starters, in YOLOv3 the output is $S\times S\times B\times (4+1+C)$ as opposed to $S\times S\times (B\times (4+1)+C)$, meaning that the last term would be $\mathbb{1}^{obj}_{ij}$.

For the rest of the question consider the first 5 terms of the last dimension of YOLOv3 output to be:
\begin{align}
b_x &= \sigma(t_x) + c_x \\
b_y &= \sigma(t_y) + c_y \\
b_w &= p_w\exp(t_w) \\
b_h &= p_h\exp(t_h) \\
Pr(object) &= \sigma(t_o)
\end{align}

See here for a full explanation of these equations, but they simply describe the coordinates of x and y, width and height, and finally the probability of an object being present in the cell.

  • My biggest question is with regards to what $C_i$ is in the 3rd and 4th term. Suppose that $\hat{C_i}$ is the estimated and $C_i$ is the true value. According to the original paper (v1) "we define confidence as $Pr(object)\times IOU^{truth}_{pred}$". So there's 3 ways that the 3rd loss term can be interpreted:

    1. $C_i=1$ and $\hat{C_i} = \sigma(t_o) \times IOU$, where the IOU depends on the bounding box described by the first 4 equations.
    2. According to this code and other implementations I've seen, it seems that $C_i=IOU$ and $\hat{C_i} = \sigma(t_o)$.
    3. Finally, the most intuitive to me is that $C_i=1$ and $\hat{C_i} = \sigma(t_o)$. Why bother introducing IOU in here? The first two terms effectively try to maximise IOU anyway?
  • Finally for YOLOv3 the paper states that:

    "During training we use sum of squared error loss. If the ground truth for some coordinate prediction is $\hat{t}_*$ our gradient is the ground truth value (computed from the ground truth box) minus our prediction: $\hat{t}_* − t_*$. This ground truth value can be easily computed by inverting the equations above".

    So does this mean I can ignore the first two terms of the loss function above and replace it with squared error?

Would appreciate any input to this including answering just one of the two questions.

Best Answer

Good questions. For the first question, the score definitions are different between YOLOv1 and YOLOv3. According to the yolov1 paper

These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts.

If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

This is the same as your second interpenetration. However in the yolov3 paper

YOLOv3 predicts an objectness score for each bounding box using logistic regression. This should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior.

This is the same as your third interpenetration. Actually both options are implemented in the code you referred to, I guess they both work in practice.

Second question, yes, just don't forget to apply the inverse functions to the ground truth coordinates first. Source code for reference.