State of the art object detection networks, such as RetinaNet, Faster R-CNN, and YOLO, use a coordinate encoding where the bounding box regression is given relative to the anchor box:
Centers:
$t_x = (x-x_a)/w_a$ and $t_y = (y-y_a)/h_a$
Height and width offsets:
$t_w = \log(w/w_a)$ and $t_h = \log(h/h_a)$
Why is the width and height prediction in logarithmic format? Is there a optimization reason for this?
Best Answer
The parametrization seems to originate from the R-CNN paper, Girschick et al., 2013: Rich feature hierarchies for accurate object detection and semantic segmentation. Note that SSD is also using this parametrization (see Eq. (2) in the paper).
Using this parametrization, size of a bounding box is computed as $w=w_a\exp(t)$, where $w_a$ is the size of the anchor box and $t$ is the network output. This parametrization has some (nice) properties:
The first property is very useful. It is hard to say if/how much the rest makes the optimization easier, but it seems to work nice since this is the de-facto standard parametrization used in object detection.