As Yoshua Bengio, Head of Montreal Institute for Learning Algorithms remarks:
"Very simple. Just keep adding layers until the test error does not improve anymore."
A method recommended by Geoff Hinton is to add layers until you start to overfit your training set. Then you add dropout or another regularization method.
The short answer to this is that if you train a neural network to recognize MNIST images, then it will work well for precisely that (i.e., recognize more MNIST images).
Now, it seems a little disappointing that we cannot use our awesome neural network that has probably an accuracy near 95% or even 99%, to recognize new handwritten digits. Who wants to be able to recognize only images in the MNIST dataset?
So, after researching a little bit more, the general consensus seems to be that if you want your neural network to be able to recognize your own handwritten digits, you have two options:
1) include enough of your own digits into the training set, along with the MNIST ones.
or
2) pre-process your own handwritten digits so they resemble engough to MNIST digits, thus fitting what the neural network has been trained for.
We could even generalize this not only for MNIST but for any domain. Either you include enough variations in your training set (enrich your training set), or you choose one data representation and pre-process any other variation to match the one you chose (pre-process your input).
I found this article precisely about how to do pre-processing to match MNIST data representation.
I personally cannot help but feel that pre-processing is kind of "cheating", because the whole point of neural networks is not having to resort to this "algorithmic" type of solutions. It would be nicer to be able to train a neural network to be able to generalize enough so any data representation (within the limits of a particular domain, like handwritten digits) will work. But it seems this are the current state-of-the-art approaches, and for some domains you'll have to go one way or the other to get good results.
EDIT: as pointed out by a StackOverflow user in the comments section of this question, this can be seen as a "Domain Adaptation" problem. I'm currently learning about Domain Adaptation techniques, in particular "Domain-Adversarial training of neural networks" (google this for several papers on the subject). In particular I'm looking at this GitHub repo that implements a Domain-Adversarial example using Tensorflow. I added this so others can be pointed in the same direction.
Best Answer
A major insight into how a neural network can learn to classify something as complex as image data given just examples and correct answers came to me while studying the work of Professor Kunihiko Fukushima on the neocognitrion in the 1980's. Instead of just showing his network a bunch of images, and using back-propagation to let it figure things on it's own, he took a different approach and trained his network layer by layer, and even node by node. He analyzed the performance and operation of each individual node of the network and intentionally modified those parts to make them respond in intended ways.
For instance, he knew he wanted the network to be able to recognize lines, so he trained specific layers and nodes to recognize three pixel horizontal lines, 3 pixel vertical lines and specific variations of diagonal lines at all angles. By doing this, he knew exactly which parts of the network could be counted on to fire when the desired patterns existed. Then, since each layer is highly connected, the entire neocognitron as a whole could identify each of the composite parts present in the image no matter where they physically existed. So when a specific line segment existed somewhere in the image, there would always be a specific node that would fire.
Keeping this picture ever present, consider linear regression which is simply finding a formula ( or a line) via sum of squared error, that passes most closely through your data, that's easy enough to understand. To find curved "lines" we can do the same sum of products calculation, except now we add a few parameters of x^2 or x^3 or even higher order polynomials. Now you have a logistic regression classifier. This classifier can find relationships that are not linear in nature. In fact logistic regression can express relationships that are arbitrarily complex, but you still need to manually choose the correct number of power features to do a good job at predicting the data.
One way to think of the neural network is to consider the last layer as a logistic regression classifier, and then the hidden layers can be thought of as automatic "feature selectors". This eliminates the work of manually choosing the correct number of, and power of, the input features. Thus, the NN becomes an automatic power feature selector and can find any linear or non-linear relationship or serve as a classifier of arbitrarily complex sets** (this, assumes only, that there are enough hidden layers and connections to represent the complexity of the model it needs to learn). In the end, a well functioning NN is expected to learn not just "the relationship" between the input and outputs, but instead we strive for an abstraction or a model that generalizes well.
As a rule of thumb, the neural network can not learn anything a reasonably intelligent human could not theoretically learn given enough time from the same data, however,