Solved – How to handle even and odd convolutional filter sizes and images

conv-neural-networkconvolutionneural networks

Is there a rule of thumb for determining the size of a convolutional filter given the shape of the input? Specifically, if you want to do a 1D convolution over an even-length vector, does the kernel need to be a divisor of the vector length? Does the kernel need to be even? I understand that when using an odd-length kernel, you align the center element with the stride position on the input vector, but with even-length kernels there is no center position. For example, if I have a vector [a,b,c,d] and a kernel [0,1], should the kernel start by aligning 1 with a or should 0 align with a?
Also is there any theory on why and how much padding and stride you should use when convolving and trade-offs between these two parameters?

Best Answer

It's usual to use stride 1 and pad the input layers such that the output layers so that they are the same size as the input layers. Using stride lengths of 1 separates the processes of feature extraction (the job of the convolutional layers) and making the model invariant to spatial translation (the job of the pooling layers). Padding the input layers ensures that the model doesn't slowly lose information about edge elements.

Related Solutions

Solved – CNN filter sizes and padding

Many recent effective CNN structures use small filters that preserve the spatial resolution, for example the VGG network and the 100-layer residual network.

I think most importantly having the same input and output size allows for simply stacking up more layers without affecting(decreasing) the spatial resolution, so that we can build deeper networks.

Moreover, with such spacial consistency we can, add some operation between the input and output of a set of layers, as in the residual network,

Formally, denoting the desired underlying mapping as $H(x)$, we let the stacked nonlinear layers fit another mapping of $F(x) := H(x)−x$.

or concatenate the output from filters of different sizes, as in the inception network.

Solved – Fractional output dimensions of “sliding-windows” (convolutions, pooling etc) in neural networks

The fraction part comes from the stride operation. Without stride, the output size should be output_no_stride = input + 2*pad - filter + 1 = 224. With stride, the conventional formula to use is output_with_stride = floor((input + 2*pad - filter) / stride) + 1 = 112.

In many programming languages, the default behavior of integer division is "round toward zero" so the floor operation can be omitted when the numerator and denominator are positive integers. (Ref: Caffe's convolution implementation, Cudnn docs)

Comparing the output dimension with and without stride

output_with_stride = floor((input + 2*pad - filter) / stride) + 1
                   = floor((output_no_stride - 1) / stride) + 1
                   = ceil(output_no_stride / stride)

Caffe's pooling is a bit complicated, it first replaces the floor with ceiling, then decreases the size by one if the last pooling does not start strictly inside the image, as shown in the code.

  pooled_height_ = static_cast<int>(ceil(static_cast<float>(
      height_ + 2 * pad_h_ - kernel_h_) / stride_h_)) + 1;
  pooled_width_ = static_cast<int>(ceil(static_cast<float>(
      width_ + 2 * pad_w_ - kernel_w_) / stride_w_)) + 1;
  if (pad_h_ || pad_w_) {
    // If we have padding, ensure that the last pooling starts strictly
    // inside the image (instead of at the padding); otherwise clip the last.
    if ((pooled_height_ - 1) * stride_h_ >= height_ + pad_h_) {
      --pooled_height_;
    }
    if ((pooled_width_ - 1) * stride_w_ >= width_ + pad_w_) {
      --pooled_width_;
    }
    CHECK_LT((pooled_height_ - 1) * stride_h_, height_ + pad_h_);
    CHECK_LT((pooled_width_ - 1) * stride_w_, width_ + pad_w_);
  }

I think the result is mostly aligned with the conventional formula except when the last pooling is entirely outside the original input.

Best Answer

Related Solutions

Solved – CNN filter sizes and padding

Solved – Fractional output dimensions of “sliding-windows” (convolutions, pooling etc) in neural networks

Related Question