Solved – Basis of LeNet 5 Hyper Parameters Selection

conv-neural-networkhyperparametermachine learning

I am reading Yann LeCun's seminal paper Gradient Based Recognition applied to document recognition. Although the paper includes details of the architecture, little is said about the choice of hyperparameters. For example the hidden fully connected layers have 120 and 84 units respectively but the reason for this choice is not mentioned. Similarly the first convolution layer has output depth 6. Are all of these hyperparmeters selected through cross validation or on the basis of some domain knowledge?

I am asking this because I am trying to train a similar network for 43 output classes. With about 40,000 training images, I achieve 99.9% training accuracy and 95% validation accuracy. I have been able to reduce the gap between the two accuracy values by adding dropout layers and tuning the 'keep' probability. I want to increase the validation to near-human level (~98%) and I'd like to tweak the different layer sizes.

Best Answer

Choosing the number of layers, the size of each layer, in terms of kernel size, number of filter layers, and so on is an art to a large extent. Thats why there so many papers that just present different architectures, like VGG, and its numerous variants, alexnet, lenet5, inception, ...

Generally speaking, I think the world has generally converged around a few concepts, such as:

kernel size 3 is common (but: inception)
stride 1 is common
with a pooling of size 2, sometimes 3, every few layers

But beyond that, there's a huge scope for variation, trial and error. And an architecture that works well on one dataset might work less well on another because, need smaller/large network capacity, appropriateness of input size to the data samples presented, appropriateness of number of layers to target, etc.

Lenet-5 is a pretty good place to start, particularly since your training set size is similar to MNIST (which has also 60,000 training examples). Dropout is good. Grid search for hyperparameters, as you suggest, tentatively, is good.

Just be aware that you should reserve a hefty chunk of your data as a test set, which you should only test on very rarely, so that you dont overfit it, which your hyperparameter search on your validation set will rapidly do.

Best Answer

Related Solutions

Solved – cross channel parametric pooling layer in the architecture of Network in Network

Solved – Why do residual networks work

Related Question