There are a number of questions to ask:
- do you have the appropriate number of neurons in each layer
- are you using the appropriate types of transfer functions?
- are you using the appropriate type of learning algorithm
- do you have a large enough sample size
- can you confirm that your samples have the right sorts of relationship with each other to be informative? (not redundant, of relevant dimension, etc...)
What can you give in the way of ephemeris? Can you tell us something about the nature of the data?
You could make a gradient boosted tree of neural networks.
You asked what happens if you stop early.
You can try yourself. Run 300x where you start with random initialized weights, and then stop at a specified number of iterations, lets say 100. At that point compute your ensemble error, your training-subset error, and your test-set error. Repeat. After you have 300 values to tell you what the error is, you can get an idea of your error distribution given 100 learning iterations. If you like, you can then sample that distribution at several other values of learning. I suggest 200, 500, and 1000 iterations. This will give you an idea how your SNR changes over time. A plot of the SNR vs iteration count can give you an idea about "cliffs" or "good enough". Sometimes there are cliffs where error collapses. Sometimes the error is acceptable at that point.
It takes "relatively simple" data or "pretty good" luck for your system to consistently converge in under 100 iterations. Both of which are not about repeatability nor are they generalizable.
Why are you thinking in terms of weights converging and not error being below a particular threshold. Have you ever heard of a voting paradox? (link) When you have cyclic interactions in your system (like feedback in Neural Networks) then you can have voting paradoxes - coupled changes. I don't know if weights alone is a sufficient indicator for convergence of the network.
You can think of the weights as a space. It has more than 3 dimensions, but it is still a space. In the "centroid" of that space is your "best fit" region. Far from the centroid is a less good fit. You can think of the current setting of your weights as a single point in that space.
Now you don't know where the "good" actually is. What you do have is a local "slope". You can perform gradient descent toward local "better" given where your point is right now. It doesn't tell you the "universal" better, but local is better than nothing.
So you start iterating, walking downhill toward that valley of betterness. You iterate until you think you are done. Maybe the value of your weights are large. Maybe they are bouncing all over the place. Maybe the compute is "taking too long". You want to be done.
So how do you know whether where you are is "good enough"?
Here is a quick test that you could do:
Take 30 uniform random subsets of the data (like a few percent of the data each) and retrain the network on them. It should be much faster. Observe how long it takes them to converge and compare it with the convergence history of the big set. Test the error of the network for the entire data on these subsets and see how that distribution of errors compares to your big error. Now bump the subset sizes up to maybe 5% of your data and repeat. See what this teaches you.
This is a variation on particle swarm optimization(see reference) modeled on how honeybees make decisions based on scouting.
You asked what happens if weights do not converge.
Neural Networks are one tool. They are not the only tool. There are others. I would look at using one of them.
I work in terms of information criteria, so I look at both the weights (parameter count) and the error. You might try one of those.
There are some types of preprocessing that can be useful. Center and Scale. Rotate using principal components. If you look at the eigenvalues in your principal components you can use skree plot rules to estimate the dimension of your data. Reducing the dimension can improve convergence. If you know something about the 'underlying physics' then you can smooth or filter the data to remove noise. Sometimes convergence is about noise in the system.
I find the idea of Compressed sensing to be interesting. It can allow radical sub-sampling of some systems without loss of generalization. I would look at some bootstrap re-sampled statistics and distributions of your data to determine if and at what level of sub-sampling the training set becomes representative. This gives you some measure of the "health" of your data.
Sometimes it is a good thing that they not converge
Have you ever heard of a voting paradox? You might think of it as a higher-count cousin to a two-way impasse. It is a loop. In a 2-person voting paradox the first person wants candidate "A" while the second wants candidate "B" (or not-A or such). The important part is that you can think of it as a loop.
Loops are important in neural networks. Feedback. Recursion. It made the perceptron able to resolve XOR-like problems. It makes loops, and sometimes the loops can act like the voting paradox, where they will keep changing weights if you had infinite iterations. They aren't meant to converge because it isn't the individual weight that matters but the interaction of the weights in the loop.
Note:
Using only 500 iterations can be a problem. I have had NN's where 10,000 iterations was barely enough. The number of iterations to be "enough" is dependent, as I have already indicated, on data, NN-topology, node-transfer functions, learning/training function, and even computer hardware. You have to have a good understanding of how they all interact with your iteration count before saying that there have been "enough" or "too many" iterations. Other considerations like time, budget, and what you want to do with the NN when you are done training it should also be considered.
Chen, R. B., Chang, S. P., Wang, W., & Wong, W. K., (2011, September). Optimal Experimental Designs via Particle Swarm Optimization Methods (preprint), Retrieved March 25, 2012, from http://www.math.ntu.edu.tw/~mathlib/preprint/2011-03.pdf
Let's start with a triviliaty: Deep neural network is simply a feedforward network with many hidden layers.
This is more or less all there is to say about the definition. Neural networks can be recurrent or feedforward; feedforward ones do not have any loops in their graph and can be organized in layers. If there are "many" layers, then we say that the network is deep.
How many layers does a network have to have in order to qualify as deep? There is no definite answer to this (it's a bit like asking how many grains make a heap), but usually having two or more hidden layers counts as deep. In contrast, a network with only a single hidden layer is conventionally called "shallow". I suspect that there will be some inflation going on here, and in ten years people might think that anything with less than, say, ten layers is shallow and suitable only for kindergarten exercises. Informally, "deep" suggests that the network is tough to handle.
Here is an illustration, adapted from here:
![Deep vs non-deep neural network](https://i.stack.imgur.com/OH3gI.png)
But the real question you are asking is, of course, Why would having many layers be beneficial?
I think that the somewhat astonishing answer is that nobody really knows. There are some common explanations that I will briefly review below, but none of them has been convincingly demonstrated to be true, and one cannot even be sure that having many layers is really beneficial.
I say that this is astonishing, because deep learning is massively popular, is breaking all the records (from image recognition, to playing Go, to automatic translation, etc.) every year, is getting used by the industry, etc. etc. And we are still not quite sure why it works so well.
I base my discussion on the Deep Learning book by Goodfellow, Bengio, and Courville which went out in 2017 and is widely considered to be the book on deep learning. (It's freely available online.) The relevant section is 6.4.1 Universal Approximation Properties and Depth.
You wrote that
10 years ago in class I learned that having several layers or one layer (not counting the input and output layers) was equivalent in terms of the functions a neural network is able to represent [...]
You must be referring to the so called Universal approximation theorem, proved by Cybenko in 1989 and generalized by various people in the 1990s. It basically says that a shallow neural network (with 1 hidden layer) can approximate any function, i.e. can in principle learn anything. This is true for various nonlinear activation functions, including rectified linear units that most neural networks are using today (the textbook references Leshno et al. 1993 for this result).
If so, then why is everybody using deep nets?
Well, a naive answer is that because they work better. Here is a figure from the Deep Learning book showing that it helps to have more layers in one particular task, but the same phenomenon is often observed across various tasks and domains:
![More layers is good](https://i.stack.imgur.com/trj4L.png)
We know that a shallow network could perform as good as the deeper ones. But it does not; and they usually do not. The question is --- why? Possible answers:
- Maybe a shallow network would need more neurons then the deep one?
- Maybe a shallow network is more difficult to train with our current algorithms (e.g. it has more nasty local minima, or the convergence rate is slower, or whatever)?
- Maybe a shallow architecture does not fit to the kind of problems we are usually trying to solve (e.g. object recognition is a quintessential "deep", hierarchical process)?
- Something else?
The Deep Learning book argues for bullet points #1 and #3. First, it argues that the number of units in a shallow network grows exponentially with task complexity. So in order to be useful a shallow network might need to be very big; possibly much bigger than a deep network. This is based on a number of papers proving that shallow networks would in some cases need exponentially many neurons; but whether e.g. MNIST classification or Go playing are such cases is not really clear. Second, the book says this:
Choosing a deep model encodes a very general belief that the function we
want to learn should involve composition of several simpler functions. This can be
interpreted from a representation learning point of view as saying that we believe
the learning problem consists of discovering a set of underlying factors of variation
that can in turn be described in terms of other, simpler underlying factors of
variation.
I think the current "consensus" is that it's a combination of bullet points #1 and #3: for real-world tasks deep architecture are often beneficial and shallow architecture would be inefficient and require a lot more neurons for the same performance.
But it's far from proven. Consider e.g. Zagoruyko and Komodakis, 2016, Wide Residual Networks. Residual networks with 150+ layers appeared in 2015 and won various image recognition contests. This was a big success and looked like a compelling argument in favour of deepness; here is one figure from a presentation by the first author on the residual network paper (note that the time confusingly goes to the left here):
![deep residual networks](https://i.stack.imgur.com/iVURh.png)
But the paper linked above shows that a "wide" residual network with "only" 16 layers can outperform "deep" ones with 150+ layers. If this is true, then the whole point of the above figure breaks down.
Or consider Ba and Caruana, 2014, Do Deep Nets Really Need to be Deep?:
In this paper we provide empirical evidence that shallow nets are capable of learning the same
function as deep nets, and in some cases with the same number of parameters as the deep nets. We
do this by first training a state-of-the-art deep model, and then training a shallow model to mimic the
deep model. The mimic model is trained using the model compression scheme described in the next
section. Remarkably, with model compression we are able to train shallow nets to be as accurate
as some deep models, even though we are not able to train these shallow nets to be as accurate as
the deep nets when the shallow nets are trained directly on the original labeled training data. If a
shallow net with the same number of parameters as a deep net can learn to mimic a deep net with
high fidelity, then it is clear that the function learned by that deep net does not really have to be deep.
If true, this would mean that the correct explanation is rather my bullet #2, and not #1 or #3.
As I said --- nobody really knows for sure yet.
Concluding remarks
The amount of progress achieved in the deep learning over the last ~10 years is truly amazing, but most of this progress was achieved by trial and error, and we still lack very basic understanding about what exactly makes deep nets to work so well. Even the list of things that people consider to be crucial for setting up an effective deep network seems to change every couple of years.
The deep learning renaissance started in 2006 when Geoffrey Hinton (who had been working on neural networks for 20+ years without much interest from anybody) published a couple of breakthrough papers offering an effective way to train deep networks (Science paper, Neural computation paper). The trick was to use unsupervised pre-training before starting the gradient descent. These papers revolutionized the field, and for a couple of years people thought that unsupervised pre-training was the key.
Then in 2010 Martens showed that deep neural networks can be trained with second-order methods (so called Hessian-free methods) and can outperform networks trained with pre-training: Deep learning via Hessian-free optimization. Then in 2013 Sutskever et al. showed that stochastic gradient descent with some very clever tricks can outperform Hessian-free methods: On the importance of initialization and momentum in deep learning. Also, around 2010 people realized that using rectified linear units instead of sigmoid units makes a huge difference for gradient descent. Dropout appeared in 2014. Residual networks appeared in 2015. People keep coming up with more and more effective ways to train deep networks and what seemed like a key insight 10 years ago is often considered a nuisance today. All of that is largely driven by trial and error and there is little understanding of what makes some things work so well and some other things not. Training deep networks is like a big bag of tricks. Successful tricks are usually rationalized post factum.
We don't even know why deep networks reach a performance plateau; just 10 years people used to blame local minima, but the current thinking is that this is not the point (when the perfomance plateaus, the gradients tend to stay large). This is such a basic question about deep networks, and we don't even know this.
Update: This is more or less the subject of Ali Rahimi's NIPS 2017 talk on machine learning as alchemy: https://www.youtube.com/watch?v=Qi1Yry33TQE.
[This answer was entirely re-written in April 2017, so some of the comments below do not apply anymore.]
Best Answer
For accessing a complexity of a model, number of free parameters is a good start, with it you can calculate AIC or BIC from number of free parameters. And getting number of free parameters in a Multi Layer Perception (MLP) neural network can be found here: Number of parameters in an artificial neural network for AIC
In addition, there are some cases, that you have a lot parameters, but they are not "totally free" / with regularization. For example, for linear regression, if you have $1000$ features but $500$ data points, it is totally OK to fit a model with $1000$ coefficients, but regularize the coefficients with a large regularization parameter. You can search Ridge Regression or Lasso Regression for details.
In Neural network case, it is also possible people have a very compacted network structure (many layers many neurons) but with some regularization in there. In that case, the method mentioned above will not work.
Finally, I would not agree your statement about random forest. As discussed in Breiman's original paper: in creasing number of trees is will not lead a more complex model / have over fitting. Instead, the out of bag (OOB) error will converge, if you have large number of trees. In practice, if computational power is not a concern, building a random forest with large number trees is actually recommended.
To your comment:
The model complexity is an abstract concept, and can be defined in different ways. AIC and BIC are some definitions and other way of defining it exists. See this Definition of model complexity in XGBoost as an example.
In addition, it is fine, if two NN has different structure, but it is still can have same complexity. Here is an example: say, we are doing polynomial regression. You have 2 ways, one is have a higher order model with more regularization, another is lower order without regularization. You can have same "complexity" but the structure are different.