Solved – Overlap-tile strategy in U-Nets

I was reading the U-Nets paper and there is a mention of some "overlap-tile strategy" in it that I am not quite familiar with. Here is the paragraph from the paper where it has been introduced:

What do they mean by "only us[ing] the valid part of each convolution"? I looked this up and from what I have understood I think they mean that their convolution operations do not involve any padding. Instead of having padded-convolutons to maintain the spatial size of the feature maps, they pad the original image by mirroring the borders and forward the pre-padded image through the network, where it gets downsampled at every convolution and the final output comes out downsampled to the same spatial dimension as the original image. If this strategy really is better than padded-convolutions, why is this not being used everywhere? If not, what is it that makes it worse?

Also, what is the overlap-tile strategy? And how does it allow the "seamless segmentation of arbitrarily large images"? The Figure 2 they are referring to is this, but I am finding it difficult to see what the figure is trying to depict.

Best Answer

On the "overlap-tile strategy" specifically:

The blue box in Fig 2 (left) shows the input to the network. Because they're using valid convolutions, the output is the smaller yellow box (right). Sounds like you understand this part already.

They're trying to show that the image that they want to predict on is bigger than the input to the network (e.g. perhaps the GPU memory is not big enough to hold the whole thing). So they have to run inference several times using different subsets of the input.

On the right side, imagine shifting the yellow box down so that the two squares are right next to each other (bottom side of original square touches top of shifted square). Do that a bunch of times to "tile" your output space. Now, you need a bigger region of the input (blue) for inference. For non-overlapping yellow boxes (in the output) you will need overlapping blue boxes (for the input).

(If its still not clear I can try drawing a picture)

Best Answer

Related Solutions

Solved – How does Krizhevsky’s ’12 CNN get 253,440 neurons in the first layer

Solved – What are the meanings of the terms “Tapped Delay Line” and “Delay Unit” in the Context of TDNNs

Related Question