Solved – How does Caffe handle non-integer convolution layer output size

conv-neural-networkdeep learningmachine learning

I am studying a project which someone did in Caffe where input image is 400 by 400 pixels and first layer is convolution with kernel_size: 11 and stride: 4. Then according to my calculations, output image size = ((400-11)/4) + 1 which is 398.25 which is not an integer. So in this case, what would the output size be? The following is the prototxt with these values:

    name: "RP"
    input: "data"
    input_dim: 32
    input_dim: 3
    input_dim: 400
    input_dim: 400
    layers {
    bottom: "data"
    top: "conv1"
    name: "conv1"
    type: CONVOLUTION
    convolution_param {
    num_output: 64
    kernel_size: 11
    stride: 4
    weight_filler {
    type: "xavier"
    }
    bias_filler {
    type: "constant"
    value: 0.1
    }

Best Answer

It should be floor((input + 2*pad -filter) / stride) + 1, which in your case is floor((400-11)/4) + 1 = floor(97.25) + 1 = 98.

ref: caffe source code
also see this answer

Related Solutions

Solved – How does Krizhevsky’s ’12 CNN get 253,440 neurons in the first layer

From the stanfords note on NN:

Real-world example. The Krizhevsky et al. architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with receptive field size F=11, stride S=4 and no zero padding P=0. Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of K=96, the Conv layer output volume had size [55x55x96]. Each of the 55*55*96 neurons in this volume was connected to a region of size [11x11x3] in the input volume. Moreover, all 96 neurons in each depth column are connected to the same [11x11x3] region of the input, but of course with different weights. As a fun aside, if you read the actual paper it claims that the input images were 224x224, which is surely incorrect because (224 - 11)/4 + 1 is quite clearly not an integer. This has confused many people in the history of ConvNets and little is known about what happened. My own best guess is that Alex used zero-padding of 3 extra pixels that he does not mention in the paper.

ref: http://cs231n.github.io/convolutional-networks/

These notes accompany the Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition. For questions/concerns/bug reports regarding contact Justin Johnson regarding the assignments, or contact Andrej Karpathy regarding the course notes

Solved – Fractional output dimensions of “sliding-windows” (convolutions, pooling etc) in neural networks

The fraction part comes from the stride operation. Without stride, the output size should be output_no_stride = input + 2*pad - filter + 1 = 224. With stride, the conventional formula to use is output_with_stride = floor((input + 2*pad - filter) / stride) + 1 = 112.

In many programming languages, the default behavior of integer division is "round toward zero" so the floor operation can be omitted when the numerator and denominator are positive integers. (Ref: Caffe's convolution implementation, Cudnn docs)

Comparing the output dimension with and without stride

output_with_stride = floor((input + 2*pad - filter) / stride) + 1
                   = floor((output_no_stride - 1) / stride) + 1
                   = ceil(output_no_stride / stride)

Caffe's pooling is a bit complicated, it first replaces the floor with ceiling, then decreases the size by one if the last pooling does not start strictly inside the image, as shown in the code.

  pooled_height_ = static_cast<int>(ceil(static_cast<float>(
      height_ + 2 * pad_h_ - kernel_h_) / stride_h_)) + 1;
  pooled_width_ = static_cast<int>(ceil(static_cast<float>(
      width_ + 2 * pad_w_ - kernel_w_) / stride_w_)) + 1;
  if (pad_h_ || pad_w_) {
    // If we have padding, ensure that the last pooling starts strictly
    // inside the image (instead of at the padding); otherwise clip the last.
    if ((pooled_height_ - 1) * stride_h_ >= height_ + pad_h_) {
      --pooled_height_;
    }
    if ((pooled_width_ - 1) * stride_w_ >= width_ + pad_w_) {
      --pooled_width_;
    }
    CHECK_LT((pooled_height_ - 1) * stride_h_, height_ + pad_h_);
    CHECK_LT((pooled_width_ - 1) * stride_w_, width_ + pad_w_);
  }

I think the result is mostly aligned with the conventional formula except when the last pooling is entirely outside the original input.

Best Answer

Related Solutions

Solved – How does Krizhevsky’s ’12 CNN get 253,440 neurons in the first layer

Solved – Fractional output dimensions of “sliding-windows” (convolutions, pooling etc) in neural networks

Related Question