There are a number of questions to ask:
- do you have the appropriate number of neurons in each layer
- are you using the appropriate types of transfer functions?
- are you using the appropriate type of learning algorithm
- do you have a large enough sample size
- can you confirm that your samples have the right sorts of relationship with each other to be informative? (not redundant, of relevant dimension, etc...)
What can you give in the way of ephemeris? Can you tell us something about the nature of the data?
You could make a gradient boosted tree of neural networks.
You asked what happens if you stop early.
You can try yourself. Run 300x where you start with random initialized weights, and then stop at a specified number of iterations, lets say 100. At that point compute your ensemble error, your training-subset error, and your test-set error. Repeat. After you have 300 values to tell you what the error is, you can get an idea of your error distribution given 100 learning iterations. If you like, you can then sample that distribution at several other values of learning. I suggest 200, 500, and 1000 iterations. This will give you an idea how your SNR changes over time. A plot of the SNR vs iteration count can give you an idea about "cliffs" or "good enough". Sometimes there are cliffs where error collapses. Sometimes the error is acceptable at that point.
It takes "relatively simple" data or "pretty good" luck for your system to consistently converge in under 100 iterations. Both of which are not about repeatability nor are they generalizable.
Why are you thinking in terms of weights converging and not error being below a particular threshold. Have you ever heard of a voting paradox? (link) When you have cyclic interactions in your system (like feedback in Neural Networks) then you can have voting paradoxes - coupled changes. I don't know if weights alone is a sufficient indicator for convergence of the network.
You can think of the weights as a space. It has more than 3 dimensions, but it is still a space. In the "centroid" of that space is your "best fit" region. Far from the centroid is a less good fit. You can think of the current setting of your weights as a single point in that space.
Now you don't know where the "good" actually is. What you do have is a local "slope". You can perform gradient descent toward local "better" given where your point is right now. It doesn't tell you the "universal" better, but local is better than nothing.
So you start iterating, walking downhill toward that valley of betterness. You iterate until you think you are done. Maybe the value of your weights are large. Maybe they are bouncing all over the place. Maybe the compute is "taking too long". You want to be done.
So how do you know whether where you are is "good enough"?
Here is a quick test that you could do:
Take 30 uniform random subsets of the data (like a few percent of the data each) and retrain the network on them. It should be much faster. Observe how long it takes them to converge and compare it with the convergence history of the big set. Test the error of the network for the entire data on these subsets and see how that distribution of errors compares to your big error. Now bump the subset sizes up to maybe 5% of your data and repeat. See what this teaches you.
This is a variation on particle swarm optimization(see reference) modeled on how honeybees make decisions based on scouting.
You asked what happens if weights do not converge.
Neural Networks are one tool. They are not the only tool. There are others. I would look at using one of them.
I work in terms of information criteria, so I look at both the weights (parameter count) and the error. You might try one of those.
There are some types of preprocessing that can be useful. Center and Scale. Rotate using principal components. If you look at the eigenvalues in your principal components you can use skree plot rules to estimate the dimension of your data. Reducing the dimension can improve convergence. If you know something about the 'underlying physics' then you can smooth or filter the data to remove noise. Sometimes convergence is about noise in the system.
I find the idea of Compressed sensing to be interesting. It can allow radical sub-sampling of some systems without loss of generalization. I would look at some bootstrap re-sampled statistics and distributions of your data to determine if and at what level of sub-sampling the training set becomes representative. This gives you some measure of the "health" of your data.
Sometimes it is a good thing that they not converge
Have you ever heard of a voting paradox? You might think of it as a higher-count cousin to a two-way impasse. It is a loop. In a 2-person voting paradox the first person wants candidate "A" while the second wants candidate "B" (or not-A or such). The important part is that you can think of it as a loop.
Loops are important in neural networks. Feedback. Recursion. It made the perceptron able to resolve XOR-like problems. It makes loops, and sometimes the loops can act like the voting paradox, where they will keep changing weights if you had infinite iterations. They aren't meant to converge because it isn't the individual weight that matters but the interaction of the weights in the loop.
Note:
Using only 500 iterations can be a problem. I have had NN's where 10,000 iterations was barely enough. The number of iterations to be "enough" is dependent, as I have already indicated, on data, NN-topology, node-transfer functions, learning/training function, and even computer hardware. You have to have a good understanding of how they all interact with your iteration count before saying that there have been "enough" or "too many" iterations. Other considerations like time, budget, and what you want to do with the NN when you are done training it should also be considered.
Chen, R. B., Chang, S. P., Wang, W., & Wong, W. K., (2011, September). Optimal Experimental Designs via Particle Swarm Optimization Methods (preprint), Retrieved March 25, 2012, from http://www.math.ntu.edu.tw/~mathlib/preprint/2011-03.pdf
Best Answer
What you describe is indeed one standard way of quantifying the importance of neural-net inputs. Note that in order for this to work, however, the input variables must be normalized in some way. Otherwise weights corresponding to input variables that tend to have larger values will be proportionally smaller. There are different normalization schemes, such as for instance subtracting off a variable's mean and dividing by its standard deviation. If the variables weren't normalized in the first place, you could perform a correction on the weights themselves in the importance calculation, such as multiplying by the standard deviation of the variable.
$I_i = \sigma_i\sum\limits_{j = 1}^{n_\text{hidden}}\left|w_{ij}\right|$.
Here $\sigma_i$ is the standard deviation of the $i$th input, $I_i$ is the $i$th input's importance, $w_{ij}$ is the weight connecting the $i$th input to the $j$th hidden node in the first layer, and $n_\text{hidden}$ is the number of hidden nodes in the first layer.
Another technique is to use the derivative of the neural-net mapping with respect to the input in question, averaged over inputs.
$I_i = \sigma_i\left\langle\left|\frac{dy}{dx_i}\right|\right\rangle$
Here $x_i$ is the $i$th input, $y$ is the output, and the expectation value is taken with respect to the vector of inputs $\mathbf{x}$.