Solved – what purpose do the grid cells serve in YOLO object detection algorithm

conv-neural-networkneural networksobject detectionyolo

so I was looking at YOLO and I read several blogs online, but one concept I'm having trouble understanding is why do we want to divide the image into different grids, and then predict the bounding box outputs for all of these cells?

Also, if an object is outside the grid, then how does this grid predict that object? I thought at each grid cell, we basically do 'Convolution Implementation of sliding window' which was explained in Andrew Ng Deep Learning course on coursera, which basically means that network looks at a particular grid cell, and localizes where the image might be. So if image is outside the grid, how is it possible(cause it only looks at things inside the grid cell)? I'm confused about the whole thing… if anyone can explain in simple terms I'd greatly appreciate it!

Thanks!

Best Answer

why do we want to divide the image into different grids, and then predict the bounding box outputs for all of these cells?

What's the alternative? Perhaps, the network could just output $N(5+C)$ units in a fully connected layer, where $N$ is the maximum number of boxes possible and $C$ is the number of categories.

But this leaves the network very "unconstrained" -- perhaps after training, the first $5+C$ units will converge to specializing in detecting pedestrians, while the last $5+C$ will specialize in cars. Or perhaps the first $5+C$ will specialize in detecting objects in the left half of the image, etc. Or maybe all the groups of $5+C$ units will converge to detecting only bikes, and all other objects in the image will be unfortunately neglected!

The reason you need a grid is to induce a bias which says "these output units here are responsible for detecting objects in/on/covering exactly this region of the image".

Also, if an object is outside the grid, then how does this grid predict that object?

See this question about the receptive field of "neurons" in a convolutional network.