Solved – Coordinate prediction parameterization in object detection networks

classificationmachine learningneural networksobject detection

State of the art object detection networks, such as RetinaNet, Faster R-CNN, and YOLO, use a coordinate encoding where the bounding box regression is given relative to the anchor box:

Centers:
$t_x = (x-x_a)/w_a$ and $t_y = (y-y_a)/h_a$

Height and width offsets:
$t_w = \log(w/w_a)$ and $t_h = \log(h/h_a)$

Why is the width and height prediction in logarithmic format? Is there a optimization reason for this?

Best Answer

The parametrization seems to originate from the R-CNN paper, Girschick et al., 2013: Rich feature hierarchies for accurate object detection and semantic segmentation. Note that SSD is also using this parametrization (see Eq. (2) in the paper).

Using this parametrization, size of a bounding box is computed as $w=w_a\exp(t)$, where $w_a$ is the size of the anchor box and $t$ is the network output. This parametrization has some (nice) properties:

  • Predicted bounding box will always have positive size
  • If $t=0$, size of the predicted box is the same as the anchor box
  • Values $t<0$ shrink the bounding box "slowly" (large decrease in prediction is small decrease in size)
  • Values $t>0$ expand the bounding box "fast" (small increase in prediction is large increase in size)

The first property is very useful. It is hard to say if/how much the rest makes the optimization easier, but it seems to work nice since this is the de-facto standard parametrization used in object detection.

Related Question