Solved – How does the YOLO network create boundaries for object detection

I'm not sure I understood how the YOLO network works. If you look at the description, https://pjreddie.com/darknet/yolo/, it appears to me that all is done thanks to convolution only. You end up with NxNxM results, where each M array contains a couple bounding boxes, classes, etc. Apparently, from the way I understand it, is each array in those M wide cells you will have 4 values that tell the center position (but only if it lies in that grid cell), then the width and the height of the bounding box. This way it seems that the bounding box is "around" the cell.

See the explanation at https://medium.com/diaryofawannapreneur/yolo-you-only-look-once-for-object-detection-explained-6f80ea7aaa1e

But each cell is an aggregation of feature cells beneath that very same cell, so how is it possible that it may encode the size of the bounding box outside of it?

Best Answer

No, YOLO doesn't use only convolutions. As you can see in the architecture diagram, there are two fully-connected layers between the main convolutional part and the final convolutional output.

These fully-connected layers allow it to essentially do regression on the bounding box center coordinates as well as the size and width which can range over the whole image. Please let me know if I haven't understood your question correctly!

Best Answer

Related Solutions

Solved – Yolo loss function for detecting 1 class

Solved – YOLO loss function width and height component explanation

Related Question