I would like to know how does one go about to implement softmax in a neural network. I know that softmax is the exponential divided by the sum of exponential of the whole Y vector which is applied at output layer. Does this mean I do the softmax function to the vector after the processing in hidden layer? If yes, what does this softmax do? Isn't it just like multiply a scale to the vector?
Solved – How to implement softmax in a neural network
neural networkssoftmax
Related Solutions
There are a number of questions to ask:
- do you have the appropriate number of neurons in each layer
- are you using the appropriate types of transfer functions?
- are you using the appropriate type of learning algorithm
- do you have a large enough sample size
- can you confirm that your samples have the right sorts of relationship with each other to be informative? (not redundant, of relevant dimension, etc...)
What can you give in the way of ephemeris? Can you tell us something about the nature of the data?
You could make a gradient boosted tree of neural networks.
You asked what happens if you stop early.
You can try yourself. Run 300x where you start with random initialized weights, and then stop at a specified number of iterations, lets say 100. At that point compute your ensemble error, your training-subset error, and your test-set error. Repeat. After you have 300 values to tell you what the error is, you can get an idea of your error distribution given 100 learning iterations. If you like, you can then sample that distribution at several other values of learning. I suggest 200, 500, and 1000 iterations. This will give you an idea how your SNR changes over time. A plot of the SNR vs iteration count can give you an idea about "cliffs" or "good enough". Sometimes there are cliffs where error collapses. Sometimes the error is acceptable at that point.
It takes "relatively simple" data or "pretty good" luck for your system to consistently converge in under 100 iterations. Both of which are not about repeatability nor are they generalizable.
Why are you thinking in terms of weights converging and not error being below a particular threshold. Have you ever heard of a voting paradox? (link) When you have cyclic interactions in your system (like feedback in Neural Networks) then you can have voting paradoxes - coupled changes. I don't know if weights alone is a sufficient indicator for convergence of the network.
You can think of the weights as a space. It has more than 3 dimensions, but it is still a space. In the "centroid" of that space is your "best fit" region. Far from the centroid is a less good fit. You can think of the current setting of your weights as a single point in that space.
Now you don't know where the "good" actually is. What you do have is a local "slope". You can perform gradient descent toward local "better" given where your point is right now. It doesn't tell you the "universal" better, but local is better than nothing.
So you start iterating, walking downhill toward that valley of betterness. You iterate until you think you are done. Maybe the value of your weights are large. Maybe they are bouncing all over the place. Maybe the compute is "taking too long". You want to be done.
So how do you know whether where you are is "good enough"?
Here is a quick test that you could do:
Take 30 uniform random subsets of the data (like a few percent of the data each) and retrain the network on them. It should be much faster. Observe how long it takes them to converge and compare it with the convergence history of the big set. Test the error of the network for the entire data on these subsets and see how that distribution of errors compares to your big error. Now bump the subset sizes up to maybe 5% of your data and repeat. See what this teaches you.
This is a variation on particle swarm optimization(see reference) modeled on how honeybees make decisions based on scouting.
You asked what happens if weights do not converge.
Neural Networks are one tool. They are not the only tool. There are others. I would look at using one of them.
I work in terms of information criteria, so I look at both the weights (parameter count) and the error. You might try one of those.
There are some types of preprocessing that can be useful. Center and Scale. Rotate using principal components. If you look at the eigenvalues in your principal components you can use skree plot rules to estimate the dimension of your data. Reducing the dimension can improve convergence. If you know something about the 'underlying physics' then you can smooth or filter the data to remove noise. Sometimes convergence is about noise in the system.
I find the idea of Compressed sensing to be interesting. It can allow radical sub-sampling of some systems without loss of generalization. I would look at some bootstrap re-sampled statistics and distributions of your data to determine if and at what level of sub-sampling the training set becomes representative. This gives you some measure of the "health" of your data.
Sometimes it is a good thing that they not converge
Have you ever heard of a voting paradox? You might think of it as a higher-count cousin to a two-way impasse. It is a loop. In a 2-person voting paradox the first person wants candidate "A" while the second wants candidate "B" (or not-A or such). The important part is that you can think of it as a loop.
Loops are important in neural networks. Feedback. Recursion. It made the perceptron able to resolve XOR-like problems. It makes loops, and sometimes the loops can act like the voting paradox, where they will keep changing weights if you had infinite iterations. They aren't meant to converge because it isn't the individual weight that matters but the interaction of the weights in the loop.
Note:
Using only 500 iterations can be a problem. I have had NN's where 10,000 iterations was barely enough. The number of iterations to be "enough" is dependent, as I have already indicated, on data, NN-topology, node-transfer functions, learning/training function, and even computer hardware. You have to have a good understanding of how they all interact with your iteration count before saying that there have been "enough" or "too many" iterations. Other considerations like time, budget, and what you want to do with the NN when you are done training it should also be considered.
Chen, R. B., Chang, S. P., Wang, W., & Wong, W. K., (2011, September). Optimal Experimental Designs via Particle Swarm Optimization Methods (preprint), Retrieved March 25, 2012, from http://www.math.ntu.edu.tw/~mathlib/preprint/2011-03.pdf
You should not use a non-linearity for the last layer before the softmax classification. The ReLU non-linearity (used now almost exclusively) will in this case simply throw away information without adding any additional benefit. You can look at the caffe implementation of the well-known AlexNet for a reference of what's done in practice.
Best Answer
Softmax is applied to the output layer, and its application introduces a non-linear activation. It is not a strict necessity for it to be applied - for instance, the logits (or preactivation, $z_j =\mathbf w_j^\top \cdot \mathbf x$), values could be used to reach a classification decision.
What is the point then? From an interpretative standpoint, softmax yields positive values, adding up to one, normalizing the output in a way that can be read as a probability mass function. Softmax provides a way to spread the values of the output neuronal layer.
Softmax has a nice derivative with respect to the preactivation values of the output layer $(z_j)$ (logits): $\small{\frac{\partial}{\partial( \mathbf{w}_i^\top \mathbf x)}}\sigma(j)=\sigma(j)\left(\delta_{ij}-\sigma(i)\right)$.
Further, the right cost function for softmax is the negative log likelihood (cross-entropy), $\small C =-\displaystyle \sum_K \delta_{kt} \log \sigma(k)= -\log \sigma(t) = -\log \left(\text{softmax}(t)\right)$, which derivative with respect to the activated output values is $\frac{\partial \,C}{\partial\,\sigma(i)}=-\frac{\delta_{it}}{\sigma(t)}$:
providing a very steep gradient in cost when the output (activated values) are very far from $1$. This gradient, which allows the weights to be adjusted throughout the training phase, would simply not be there if we didn't apply the softmax activation to the logits - we would be using the mean squared error cost function.
Combining these two derivatives, and applying the chain rule
$$\small \frac{\partial C}{\partial z_i}=\frac{\partial C}{\partial(\mathbf{w}_i^\top \mathbf x)}=\sum_K \frac{\partial C}{\partial \sigma(k)}\frac{\partial \sigma(k)}{\partial z_k}$$
...results in a very simple and practical derivative: $\frac{\partial}{\partial z_i}\;-\log\left( \sigma(t)\right) =\sigma(i) - \delta_{jt}$ used in backpropagation during training. This derivative is never more than $1$ or less than $-1$, and it gets small when the activated output is close to the right answer.
References:
The softmax output function [Neural Networks for Machine Learning] by Geoffrey Hinton
Peter's notes
Coursera NN Course by Geoffrey Hinton - assignment exercise
Neural networks [2.2] and [2.3] : Training neural networks - loss function Hugo Larochelle's
Why You Should Use Cross-Entropy Error Instead Of Classification Error Or Mean Squared Error For Neural Network Classifier Training by J.D. McCaffrey