Solved – WaveNet is not really a dilated convolution, is it

conv-neural-networkdeep learningneural networkstensorflow

In the recent WaveNet paper, the authors refer to their model as having stacked layers of dilated convolutions.
They also produce the following charts, explaining the difference between 'regular' convolutions and dilated convolutions.

The regular convolutions looks like

This is a convolution with a filter size of 2 and a stride of 1, repeated for 4 layers.

They then show an architecture used by their model , which they refer to as dilated convolutions. It looks like this.

They say that each layer has increasing dilations of (1, 2, 4, 8). But to me this looks like a regular convolution with a filter size of 2 and a stride of 2, repeated for 4 layers.

As I understand it, a dilated convolution, with a filter size of 2, stride of 1, and increasing dilations of (1, 2, 4, 8), would look like this.

In the WaveNet diagram, none of the filters skip over an available input. There are no holes. In my diagram ,each filter skips over (d – 1) available inputs. This is how dilation is supposed to work no?

So my question is, which (if any) of the following propositions are correct?

I don't understand dilated and/or regular convolutions.
Deepmind did not actually implement a dilated convolution, but rather a strided convolution, but misused the word dilation.
Deepmind did implement a dilated convolution, but did not implement the chart correctly.

I am not fluent enough in TensorFlow code to understand what their code is doing exactly, but I did post a related question on Stack Exchange, which contains the bit of code that could answer this question.

Best Answer

From wavenet's paper:

"A dilated convolution (also called a trous, or convolution with 
holes) is a convolution where the filter is applied over an area larger 
than its length by skipping input values with a certain step. It is 
equivalent to a convolution with a larger filter derived from the 
original filter by dilating it with zeros, but is significantly more 
efficient. A dilated convolution  effectively allows the network to 
operate on a coarser scale than with a normal convolution. This is 
similar to pooling or strided  convolutions, but 
here the output has the same size as the input. As a special case, 
dilated convolution with dilation 1 yields the standard convolution. 
Fig. 3 depicts dilated causal convolutions for dilations 1, 2, 4, and 
8."

The animations shows fixed stride one and dilation factor increasing on each layer.

Related Solutions

Solved – Fractional output dimensions of “sliding-windows” (convolutions, pooling etc) in neural networks

The fraction part comes from the stride operation. Without stride, the output size should be output_no_stride = input + 2*pad - filter + 1 = 224. With stride, the conventional formula to use is output_with_stride = floor((input + 2*pad - filter) / stride) + 1 = 112.

In many programming languages, the default behavior of integer division is "round toward zero" so the floor operation can be omitted when the numerator and denominator are positive integers. (Ref: Caffe's convolution implementation, Cudnn docs)

Comparing the output dimension with and without stride

output_with_stride = floor((input + 2*pad - filter) / stride) + 1
                   = floor((output_no_stride - 1) / stride) + 1
                   = ceil(output_no_stride / stride)

Caffe's pooling is a bit complicated, it first replaces the floor with ceiling, then decreases the size by one if the last pooling does not start strictly inside the image, as shown in the code.

  pooled_height_ = static_cast<int>(ceil(static_cast<float>(
      height_ + 2 * pad_h_ - kernel_h_) / stride_h_)) + 1;
  pooled_width_ = static_cast<int>(ceil(static_cast<float>(
      width_ + 2 * pad_w_ - kernel_w_) / stride_w_)) + 1;
  if (pad_h_ || pad_w_) {
    // If we have padding, ensure that the last pooling starts strictly
    // inside the image (instead of at the padding); otherwise clip the last.
    if ((pooled_height_ - 1) * stride_h_ >= height_ + pad_h_) {
      --pooled_height_;
    }
    if ((pooled_width_ - 1) * stride_w_ >= width_ + pad_w_) {
      --pooled_width_;
    }
    CHECK_LT((pooled_height_ - 1) * stride_h_, height_ + pad_h_);
    CHECK_LT((pooled_width_ - 1) * stride_w_, width_ + pad_w_);
  }

I think the result is mostly aligned with the conventional formula except when the last pooling is entirely outside the original input.

machine-learning – Understanding the Receptive Field of a Stack of Dilated Convolutions

I think it should be 1024*3.

After the first block, the indices of the receptive fields of the outputs should be 1-1024, 2-1025, 3-1026, etc. (assuming no padding, but receptive field size should be same with padding anyways). When you make the second block with a receptive field size of 1024, the first output of that block will "see" the outputs that had receptive field indices 1-1024, 2-1025, ... 1024-2048. So its receptive field covers 1-2048. So each block just adds 1024 to the overall receptive field size I think.

In general, I think the formula for the receptive field size s of a layer l should be:

$s_{l_0} = 1$

$s_{l_i}=s_{l_i} + (kernel size - 1) \cdot dilationfactor$

If this is correct, their kernel size seems to be 2 (to arrive at 1024 receptive field size), which is a bit surprising, I hope it is not due to some fault of my logic :)

Stacking of the blocks might be also more useful to refine outputs at a more finegrained level after having processed larger receptive fields in the previous block, rather than just maximally increasing receptive field size.

Best Answer

Related Solutions

Solved – Fractional output dimensions of “sliding-windows” (convolutions, pooling etc) in neural networks

machine-learning – Understanding the Receptive Field of a Stack of Dilated Convolutions

Related Question