Solved – Understanding the YOLO algorithm

conv-neural-networkmachine learningneural networksobject detectionyolo

I am trying to understand the logic behind object detection and the YOLO algorithm. I have read numerous blogs, tutorials, videos, papers, yet I am still not sure if I understood it correctly. Let's assume our network can predict three classes —cat, dog, bird— and we want to have $10 \times 10$ boxes.

1) We create CNN without FC layers, instead we use $1 \times 1$ Dimension Convolution layer.

2) We train it to input a $10 \times 10$ image, and output a vector with dimension of $1 \times 1 \times 5 \times 8$, where 5 is the number of anchor boxes and 8 is the number of features (the features are a confidence score, measuring how sure CNN is something is there, x, y, width, height, and indicator variables for the classes). In training, we say the object belongs to $10 \times 10$ that contains its center.

Questions: Do we put anchor boxes there or do we just train CNN to output random boxes there? If we do the latter, while training it, how do we label them? For example, if there is one small bird, do we divide him into these boxes or what's the trick? How do we say the image belongs to some $10 \times 10$ boxes? Also wouldn't this mean that all $10 \times 10$ boxes that does not contain center would get confidence score of zero?

3) We input our image into our CNN networks (that takes a $10 \times 10$ image). Thanks to convolution the output will be $K \times K$, where $\forall i\in{K}$ is actually our vector of features.

4) We remove all boxes that have confidence scores $< 0.6$.

5) We use nonmax suppression to retrieve our boxes

Question: Is this correct? Have I made any mistakes?

Best Answer

You have sort of lost me from the start. There are two notions of "boxes": one is the set of "grid cells", which the image is split into from the start, while the second is the set of bounding boxes which actually implement the detection of the objects. I will use the phrase "box to refer to the latter. I assume you meant the grid cells to be 10 by 10, since the bounding boxes do not have a fixed width or height.

(1) The pretraining is done with FC layers that are later discarded in favour of newly trained ones, but in all cases FC layers are present. See page 3 of the paper.

(2) I am not sure what you mean here, but it sounds like you think the network operates on one grid cell at a time. This is not the case, e.g. see page 2 of the paper:

Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance.

In essence, the network operates on the whole image simultaneously. More specifically, given an image cut into $S\times S$ grid cells, we output $BS^2$ bounding boxes (i.e. $B$ boxes per grid cell, where $B$ is a fixed constant), a total number which is constant across images (meaning the size of the FC layers can be constant). The boxes are not confined to that grid cell, rather, as noted on page 2:

The width and height are predicted relative to the whole image.

In essence, the centre of the output boxes (per grid cell) are confined to that grid cell, but the box itself can grow outside that boundary.

So, regarding your question "while training how do we label them", we do two things: (a) the object should be detected by the specific grid cell in which its centre resides, meaning boxes centered in other grid cells will ignore it (this is implemented in the loss function's masks), (b) the "label" consists of the properties of the ground truth object class and its class, which is learned by minimizing several losses: (i) matching the coordinates of the centres and the size of the box, (ii) being confident in an object existing only if an object is actually present (and vice versa if one is not), and (iii) a classification loss for which object is actually there. Regarding the loss function, check out these two questions: [1] and [2].

With respect to your question "If there is for example one small bird, do we divide him into these boxes or whats the trick?", here's what will happen in various cases. If she is entirely in one grid cell, we will try to produce one box centered on the bird, that exactly encapsulates it. If she crosses between two grid cells (i.e. lies on a boundary), we will try to produce a box whose center is in the same box as here, but the box itself spills outside the grid cell. We prevent multiple boxes per object by (I) only having the grid cell containing the target's centre predict it, (II) during training time only having the predicted box with the best overlap be considered in the loss (i.e. the training itself encourages specialization of local boxes), and (III) non-maximum suppression (especially useful when she is near the boundary).

Notice that because $B$ is constant, we can run into issues if there are many little birds in one box; as the authors note (page 4):

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.

(3) Again, the whole image is processed by the network at once. In other words, the entire predictor can be viewed as a single function $$f : \mathcal{I} \rightarrow \mathbb{R}^{S\times S\times(5B+|C|)} $$ where $\mathcal{I}$ is the space of RGB images and $C$ is the set of categories considered by the classifier.

(4-5) Seems fine.

Hopefully that helps.