Questions About Training Multilayer Perceptrons on MNIST Samples with Complex Structures

neural networks

According to the MNIST database, the MLP with the configuration 784 input neurons, 2500, 2000, 1500, 1000, 500 neurons in the hidden layers, and 10 output neurons can be trained with a learning rate of 0.35% (=0.0035). The network trains by using stochastic gradient descent (without (mini-)batches) with backpropagation, and the sigmoid function is used as activation function for every neuron in every layer.

  1. The weights and biases are both initialised randomly between a range of -1 and 1, is this correct?
  2. How many epochs (1 epoch = all samples being trained once) are needed for the 60000 image dataset to get that accuracy close to 99.65% which is claimed by Cireşan, Meier, Gambardella and Schmidhuber (2010) (arXiv)?
  3. My self-built neural network does 100 iterations (100 iterations = training 100 samples) in 20 seconds with this network configuration on a 8 GB RAM computer, using Java. Is this considered as average speed, or is Tensorflow faster, if so how many times?

Best Answer

[...] The network trains by using stochastic gradient descent (without (mini-)batches) with backpropagation, and the sigmoid function is used as activation function for every neuron in every layer.

You are saying that you want to reproduce the accuracy figure from Ciresan et al (2010), but you are not using the same protocol as the authors, so you have no guarantees to reproduce it. One discrepancy is that they used variable learning rate, you don't.

  1. The weights and biases are both initialised randomly between a range of -1 and 1, is this correct?

There's no such a thing as "correct" initialization, but if you want to reproduce the paper, use what they did i.e. "uniform random distribution in [−0.05,0.05]".

  1. How many epochs (1 epoch = all samples being trained once) are needed for the 60000 image dataset to get that accuracy close to 99.65% which is claimed by Cireşan, Meier, Gambardella and Schmidhuber (2010) (arXiv)?

The paper doesn't say how many epochs did it take. You can always train with a stopping criterion that the training is stopped when it hits the desired accuracy. But, as said above, you are not using exactly the same settings as the authors, so it may train indefinitely.

  1. My self-built neural network does 100 iterations (100 iterations = training 100 samples) in 20 seconds with this network configuration on a 8 GB RAM computer, using Java. Is this considered as average speed, or is Tensorflow faster, if so how many times?

There's no such a thing as "average speed". It depends on the implementation, your hardware, the data, etc. If you want to compare it to something, the authors of the paper say:

Networks with up to 12 million weights can successfully be trained by plain gradient descent to achieve test errors below 1% after 20-30 epochs in less than 2 hours of training.

They had 50,000 images in train set, so it's 50,000 (samples) × 25 (epochs) / 120 (minutes) / 60 (seconds) = 173.61 iterations per second on hardware from 10 years ago. In TensorFlow's documentation, you can find an example of a simple feed-forward network trained on MNIST at speed of around one second per epoch.

Don't re-implement such stuff by hand, use the ready frameworks like PyTorch or TensorFlow. The frameworks are well-tested and optimized. The only two reasons to implement such stuff by yourself is when you are doing this just for fun, as a learning project, or if you already used the frameworks before but they do not work for you for some reason (e.g. you have some esoteric technical requirements, it's a research project that involves something that was not implemented in the frameworks, etc).

Related Question