There are a number of questions to ask:
- do you have the appropriate number of neurons in each layer
- are you using the appropriate types of transfer functions?
- are you using the appropriate type of learning algorithm
- do you have a large enough sample size
- can you confirm that your samples have the right sorts of relationship with each other to be informative? (not redundant, of relevant dimension, etc...)
What can you give in the way of ephemeris? Can you tell us something about the nature of the data?
You could make a gradient boosted tree of neural networks.
You asked what happens if you stop early.
You can try yourself. Run 300x where you start with random initialized weights, and then stop at a specified number of iterations, lets say 100. At that point compute your ensemble error, your training-subset error, and your test-set error. Repeat. After you have 300 values to tell you what the error is, you can get an idea of your error distribution given 100 learning iterations. If you like, you can then sample that distribution at several other values of learning. I suggest 200, 500, and 1000 iterations. This will give you an idea how your SNR changes over time. A plot of the SNR vs iteration count can give you an idea about "cliffs" or "good enough". Sometimes there are cliffs where error collapses. Sometimes the error is acceptable at that point.
It takes "relatively simple" data or "pretty good" luck for your system to consistently converge in under 100 iterations. Both of which are not about repeatability nor are they generalizable.
Why are you thinking in terms of weights converging and not error being below a particular threshold. Have you ever heard of a voting paradox? (link) When you have cyclic interactions in your system (like feedback in Neural Networks) then you can have voting paradoxes - coupled changes. I don't know if weights alone is a sufficient indicator for convergence of the network.
You can think of the weights as a space. It has more than 3 dimensions, but it is still a space. In the "centroid" of that space is your "best fit" region. Far from the centroid is a less good fit. You can think of the current setting of your weights as a single point in that space.
Now you don't know where the "good" actually is. What you do have is a local "slope". You can perform gradient descent toward local "better" given where your point is right now. It doesn't tell you the "universal" better, but local is better than nothing.
So you start iterating, walking downhill toward that valley of betterness. You iterate until you think you are done. Maybe the value of your weights are large. Maybe they are bouncing all over the place. Maybe the compute is "taking too long". You want to be done.
So how do you know whether where you are is "good enough"?
Here is a quick test that you could do:
Take 30 uniform random subsets of the data (like a few percent of the data each) and retrain the network on them. It should be much faster. Observe how long it takes them to converge and compare it with the convergence history of the big set. Test the error of the network for the entire data on these subsets and see how that distribution of errors compares to your big error. Now bump the subset sizes up to maybe 5% of your data and repeat. See what this teaches you.
This is a variation on particle swarm optimization(see reference) modeled on how honeybees make decisions based on scouting.
You asked what happens if weights do not converge.
Neural Networks are one tool. They are not the only tool. There are others. I would look at using one of them.
I work in terms of information criteria, so I look at both the weights (parameter count) and the error. You might try one of those.
There are some types of preprocessing that can be useful. Center and Scale. Rotate using principal components. If you look at the eigenvalues in your principal components you can use skree plot rules to estimate the dimension of your data. Reducing the dimension can improve convergence. If you know something about the 'underlying physics' then you can smooth or filter the data to remove noise. Sometimes convergence is about noise in the system.
I find the idea of Compressed sensing to be interesting. It can allow radical sub-sampling of some systems without loss of generalization. I would look at some bootstrap re-sampled statistics and distributions of your data to determine if and at what level of sub-sampling the training set becomes representative. This gives you some measure of the "health" of your data.
Sometimes it is a good thing that they not converge
Have you ever heard of a voting paradox? You might think of it as a higher-count cousin to a two-way impasse. It is a loop. In a 2-person voting paradox the first person wants candidate "A" while the second wants candidate "B" (or not-A or such). The important part is that you can think of it as a loop.
Loops are important in neural networks. Feedback. Recursion. It made the perceptron able to resolve XOR-like problems. It makes loops, and sometimes the loops can act like the voting paradox, where they will keep changing weights if you had infinite iterations. They aren't meant to converge because it isn't the individual weight that matters but the interaction of the weights in the loop.
Note:
Using only 500 iterations can be a problem. I have had NN's where 10,000 iterations was barely enough. The number of iterations to be "enough" is dependent, as I have already indicated, on data, NN-topology, node-transfer functions, learning/training function, and even computer hardware. You have to have a good understanding of how they all interact with your iteration count before saying that there have been "enough" or "too many" iterations. Other considerations like time, budget, and what you want to do with the NN when you are done training it should also be considered.
Chen, R. B., Chang, S. P., Wang, W., & Wong, W. K., (2011, September). Optimal Experimental Designs via Particle Swarm Optimization Methods (preprint), Retrieved March 25, 2012, from http://www.math.ntu.edu.tw/~mathlib/preprint/2011-03.pdf
There's an excellent writeup to this question (and to the question of 'how many hidden layers?' as well) at https://stackoverflow.com/questions/10565868/what-is-the-criteria-for-choosing-number-of-hidden-layers-and-nodes-in-hidden-la . It may be disappointing to find that there are few hard-and-fast rules, and if there are, they are often mathematically or logically suspect. Also, another answer in that thread referenced this webpage: ftp://ftp.sas.com/pub/neural/FAQ3.html#A_hu .
Alternatively, depending on how computationally intensive it is to train your network, you can use various optimization algorithms to try to find it.
As for the more general question of whether or not layer size should stay constant, I would suggest considering that as a dimensionality-reduction procedure. Would you want your data to be compressed into a lower dimensional form and lose some information? This can be a positive or negative thing. For image compression, it's a requirement. See http://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/Applications/imagecompression.html for references on 'bottleneck' layers with image compression.
The type of problem which I would want to have big -> small -> big or some variety of that would probably involve a high dimensional source of data which I would like to compress and then learn features from. If you think that this describes your problem, then perhaps it is a valid approach to use more hidden units, feed into fewer units, then expand the layer out again.
Best Answer
Three sentence version:
Each layer can apply any function you want to the previous layer (usually a linear transformation followed by a squashing nonlinearity).
The hidden layers' job is to transform the inputs into something that the output layer can use.
The output layer transforms the hidden layer activations into whatever scale you wanted your output to be on.
Like you're 5:
If you want a computer to tell you if there's a bus in a picture, the computer might have an easier time if it had the right tools.
So your bus detector might be made of a wheel detector (to help tell you it's a vehicle) and a box detector (since the bus is shaped like a big box) and a size detector (to tell you it's too big to be a car). These are the three elements of your hidden layer: they're not part of the raw image, they're tools you designed to help you identify busses.
If all three of those detectors turn on (or perhaps if they're especially active), then there's a good chance you have a bus in front of you.
Neural nets are useful because there are good tools (like backpropagation) for building lots of detectors and putting them together.
Like you're an adult
A feed-forward neural network applies a series of functions to the data. The exact functions will depend on the neural network you're using: most frequently, these functions each compute a linear transformation of the previous layer, followed by a squashing nonlinearity. Sometimes the functions will do something else (like computing logical functions in your examples, or averaging over adjacent pixels in an image). So the roles of the different layers could depend on what functions are being computed, but I'll try to be very general.
Let's call the input vector $x$, the hidden layer activations $h$, and the output activation $y$. You have some function $f$ that maps from $x$ to $h$ and another function $g$ that maps from $h$ to $y$.
So the hidden layer's activation is $f(x)$ and the output of the network is $g(f(x))$.
Why have two functions ($f$ and $g$) instead of just one?
If the level of complexity per function is limited, then $g(f(x))$ can compute things that $f$ and $g$ can't do individually.
An example with logical functions:
For example, if we only allow $f$ and $g$ to be simple logical operators like "AND", "OR", and "NAND", then you can't compute other functions like "XOR" with just one of them. On the other hand, we could compute "XOR" if we were willing to layer these functions on top of each other:
First layer functions:
Second layer function:
The network's output is just the result of this second function. The first layer transforms the inputs into something that the second layer can use so that the whole network can perform XOR.
An example with images:
Slide 61 from this talk--also available here as a single image--shows (one way to visualize) what the different hidden layers in a particular neural network are looking for.
The first layer looks for short pieces of edges in the image: these are very easy to find from raw pixel data, but they're not very useful by themselves for telling you if you're looking at a face or a bus or an elephant.
The next layer composes the edges: if the edges from the bottom hidden layer fit together in a certain way, then one of the eye-detectors in the middle of left-most column might turn on. It would be hard to make a single layer that was so good at finding something so specific from the raw pixels: eye detectors are much easier to build out of edge detectors than out of raw pixels.
The next layer up composes the eye detectors and the nose detectors into faces. In other words, these will light up when the eye detectors and nose detectors from the previous layer turn on with the right patterns. These are very good at looking for particular kinds of faces: if one or more of them lights up, then your output layer should report that a face is present.
This is useful because face detectors are easy to build out of eye detectors and nose detectors, but really hard to build out of pixel intensities.
So each layer gets you farther and farther from the raw pixels and closer to your ultimate goal (e.g. face detection or bus detection).
Answers to assorted other questions
"Why are some layers in the input layer connected to the hidden layer and some are not?"
The disconnected nodes in the network are called "bias" nodes. There's a really nice explanation here. The short answer is that they're like intercept terms in regression.
"Where do the "eye detector" pictures in the image example come from?"
I haven't double-checked the specific images I linked to, but in general, these visualizations show the set of pixels in the input layer that maximize the activity of the corresponding neuron. So if we think of the neuron as an eye detector, this is the image that the neuron considers to be most eye-like. Folks usually find these pixel sets with an optimization (hill-climbing) procedure.
In this paper by some Google folks with one of the world's largest neural nets, they show a "face detector" neuron and a "cat detector" neuron this way, as well as a second way: They also show the actual images that activate the neuron most strongly (figure 3, figure 16). The second approach is nice because it shows how flexible and nonlinear the network is--these high-level "detectors" are sensitive to all these images, even though they don't particularly look similar at the pixel level.
Let me know if anything here is unclear or if you have any more questions.