machine-learning – Understanding the Receptive Field of a Stack of Dilated Convolutions

convolutiondeep learningmachine learningneural networks

I'm reading the Wavenet paper which says:

Stacked dilated convolutions enable networks to have very large
receptive fields with just a few layers, while preserving the input
resolution throughtout the network as well as computational
efficiency. In this paper, the dilation is doubled for every layer up
to a limit and then repeated: e.g.,

1, 2, 4, …, 512, 1, 2, 4, …, 512, 1,2,4, …, 512.

…Each 1, 2, 4, …, 512 block has a receptive field size of 1024…Stacking these blocks further increases the model capacity and the receptive field size.

I'm trying to understand what the receptive field is after the Nth block of 1, 2, 4, …, 512. After the second block, is it 1024*1024, and after the third block is it 1024^3 ? Or am I misunderstanding how the receptive field size expands?

Best Answer

I think it should be 1024*3.

After the first block, the indices of the receptive fields of the outputs should be 1-1024, 2-1025, 3-1026, etc. (assuming no padding, but receptive field size should be same with padding anyways). When you make the second block with a receptive field size of 1024, the first output of that block will "see" the outputs that had receptive field indices 1-1024, 2-1025, ... 1024-2048. So its receptive field covers 1-2048. So each block just adds 1024 to the overall receptive field size I think.

In general, I think the formula for the receptive field size s of a layer l should be:

$s_{l_0} = 1$

$s_{l_i}=s_{l_i} + (kernel size - 1) \cdot dilationfactor$

If this is correct, their kernel size seems to be 2 (to arrive at 1024 receptive field size), which is a bit surprising, I hope it is not due to some fault of my logic :)

Stacking of the blocks might be also more useful to refine outputs at a more finegrained level after having processed larger receptive fields in the previous block, rather than just maximally increasing receptive field size.