Solved – How does Krizhevsky’s ’12 CNN get 253,440 neurons in the first layer

conv-neural-networkdeep learningneural networks

In Alex Krizhevsky, et al. Imagenet classification with deep convolutional neural networks they enumerate the number of neurons in each layer (see diagram below).

The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000.

CNN

A 3D View

The number of neurons for all layers after the first is clear. One simple way to calculate the neurons is to simply multiply the three dimensions of that layer (planes X width X height):

  • Layer 2: 27x27x128 * 2 = 186,624
  • Layer 3: 13x13x192 * 2 = 64,896
  • etc.

However, looking at the first layer:

  • Layer 1: 55x55x48 * 2 = 290400

Notice that this is not 253,440 as specified in the paper!

Calculate Output Size

The other way to calculate the output tensor of a convolution is:

If the input image is a 3D tensor nInputPlane x height x width, the output image size will
be nOutputPlane x owidth x oheight where

owidth = (width - kW) / dW + 1

oheight = (height - kH) / dH + 1 .

(from Torch SpatialConvolution Documentation)

The input image is:

  • nInputPlane = 3
  • height = 224
  • width = 224

And the convolution layer is:

  • nOutputPlane = 96
  • kW = 11
  • kH = 11
  • dW = 4
  • dW = 4

(e.g. kernel size 11, stride 4)

Plugging in those numbers we get:


owidth = (224 - 11) / 4 + 1 = 54
oheight = (224 - 11) / 4 + 1 = 54

So we're one short of the 55x55 dimensions we need to match the paper. They might be padding (but the cuda-convnet2 model explicitly sets the padding to 0)

If we take the 54-size dimensions we get 96x54x54 = 279,936 neurons – still too many.

So my question is this:

How do they get 253,440 neurons for the first convolutional layer? What am I missing?

Best Answer

From the stanfords note on NN:

Real-world example. The Krizhevsky et al. architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with receptive field size F=11, stride S=4 and no zero padding P=0. Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of K=96, the Conv layer output volume had size [55x55x96]. Each of the 55*55*96 neurons in this volume was connected to a region of size [11x11x3] in the input volume. Moreover, all 96 neurons in each depth column are connected to the same [11x11x3] region of the input, but of course with different weights. As a fun aside, if you read the actual paper it claims that the input images were 224x224, which is surely incorrect because (224 - 11)/4 + 1 is quite clearly not an integer. This has confused many people in the history of ConvNets and little is known about what happened. My own best guess is that Alex used zero-padding of 3 extra pixels that he does not mention in the paper.

ref: http://cs231n.github.io/convolutional-networks/

These notes accompany the Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition. For questions/concerns/bug reports regarding contact Justin Johnson regarding the assignments, or contact Andrej Karpathy regarding the course notes