Solved – heuristic for determining the size of a fully connected layer at the end of a CNN

artificial intelligenceconv-neural-networkdeep learningmachine learningneural networks

For example, in VGG/OxfordNet, the fully connected (dense) layers that precede the final classification layer are of size 4096.

Similarly, in an AlexNet

… the number of neurons in the network’s remaining layers is given by
253,440–186,624–64,896–64,896–43,264–4096–4096–1000.

My question is what is the rationale for using this particular number of neurons in the dense layers?

I haven't found a satisfactory explanation, given that each of the above examples use different resolution images and different number of kernels and kernel sizes, thereby generating different number of parameters—but regardless use similar dense layers.

Even if this is purely empirical, I'd be interested in hearing your thoughts on how this is decided.

Best Answer

Choosing a network architecture is a bit of a "black art".

They might have tried multiple different parameters and chose one that worked well (evaluating each using cross-validation). Also, you can inform your choice by what has been reported in the research literature to work well on similar tasks, and use that as a starting point for experimentation.

One consideration here is the number of weights that can be set independently: the more of those you have, the greater the risk of overfitting, and the greater the training time. So, increasing this number makes training take longer and runs a higher risk of overfitting, but potentially increases the expressiveness of your neural network. You probably want the number to be as small as possible, without sacrificing accuracy. So, you might try something small and increase it until you stop getting improvements in accuracy (measuring using cross-validation).

Related Solutions

Solved – How does Krizhevsky’s ’12 CNN get 253,440 neurons in the first layer

From the stanfords note on NN:

Real-world example. The Krizhevsky et al. architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with receptive field size F=11, stride S=4 and no zero padding P=0. Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of K=96, the Conv layer output volume had size [55x55x96]. Each of the 55*55*96 neurons in this volume was connected to a region of size [11x11x3] in the input volume. Moreover, all 96 neurons in each depth column are connected to the same [11x11x3] region of the input, but of course with different weights. As a fun aside, if you read the actual paper it claims that the input images were 224x224, which is surely incorrect because (224 - 11)/4 + 1 is quite clearly not an integer. This has confused many people in the history of ConvNets and little is known about what happened. My own best guess is that Alex used zero-padding of 3 extra pixels that he does not mention in the paper.

ref: http://cs231n.github.io/convolutional-networks/

These notes accompany the Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition. For questions/concerns/bug reports regarding contact Justin Johnson regarding the assignments, or contact Andrej Karpathy regarding the course notes

Solved – Cannot make this autoencoder network function properly (with convolutional and maxpool layers)

You might gain more insight by visualizing the weights instead of just the reconstructions. I had a similar problem when my biases were misconfigured. Everything below is written based on my experiences writing my own learning library. You can see the code here on Github http://github.com/josephcatrambone/aij.

Here is a screenshot of my program when there are no biases. This is after only maybe ten epochs since I'm in a hurry to finish this writeup:

The weight update is done by these operations:

weights.add_i(positiveProduct.subtract(negativeProduct).elementMultiply(learningRate / (float) batchSize));
//visibleBias.add_i(batch.subtract(negativeVisibleProbabilities).meanRow().elementMultiply(learningRate));
//hiddenBias.add_i(positiveHiddenProbabilities.subtract(negativeHiddenProbabilities).meanRow().elementMultiply(learningRate));

If I uncomment the visible bias code, I get this result:

If I screw up the sign of the visible bias code (subtracting instead of adding):

visibleBias.subtract_i(batch.subtract(negativeVisibleProbabilities).meanRow().elementMultiply(learningRate));

I get this image:

Which snowballs and eventually reaches something like what you have above. Check the signage of your error functions.

Best Answer

Related Solutions

Solved – How does Krizhevsky’s ’12 CNN get 253,440 neurons in the first layer

Solved – Cannot make this autoencoder network function properly (with convolutional and maxpool layers)

Related Question