Here are my thoughts on what could be going wrong:
Accuracy (what is being measured)
Perhaps your network is in fact doing well.
Let's consider binomial classification. If we have 50-50 distribution of labels, then 50% accuracy means the model is no better than chance (flipping a coin). If the Bernoulli distribution is 80%-20% and the accuracy is 50%, then the model is worse than chance.
No matter what I try, I'm not seeing better than 20% accuracy when I add a hidden layer.
If the accuracy is 20%, just negate the output and you have 80% accuracy, well done! (well at least for the binomial case).
Not so fast!
I believe that in your case the accuracy is misleading.
This is a good read on the matter.
For classification, the AUC (area under the curve) is often used.
It's common to also examine the Receiver operating characteristic (ROC) and the confusion matrix.
For the multi-class case this becomes more tricky. Here is an answer that I found. Ultimately, this involves a strategy of 1-vs-rest
or 1-vs-1 pairs, more on that here.
Pre-processing
Are the features scaled? Do they have the same bounds? e.g [0,1]
Have you tried standardizing the features? This renders each feature normally distributed with zero mean and unit variance.
Perhaps normalization might help? Dividing each input vector by it's norm places it on the unit circle (for L2 norm) and also bounds the features (but scaling should be performed first otherwise the larger numbers will spike).
Training
As to the learning rate and momentum, if you're not in a big hurry, I would just set a low learning rate and the algorithm will converge better (although slower). This is valid for stochastic gradient descent where examples are shown at random (are you shuffling the data?).
From your code I can't figure out how this happens.
Are you going one pass only through the training data? For SGD, multiple iterations are made. Perhaps try smaller batches? Have you tried weight decay as a regularization method?
Architecture
Cross-entropy as loss function: check.
Softmax at outputs: check.
Might be a longshot at this point but have you tried projection to a higher dimension in the first hidden layer then collapsing to a lower space in the next one two hidden layers?
There is also the cost in your output, I wonder if it could be scaled to make more sense. I would try to plot the evolution of the cost (log loss here) and see if it fluctuates or how steep it is. Your network might be stuck in a local minima plateau. Or it might be doing very well in which case double check the metric?
Hope this helped or generated some new ideas.
EDIT:
Example of how normalization (L2) can make things worse when features are not scaled relative to the other features. Plots for one sample:
In the left image the blue line is a vector of 10 values generated randomly with a mean zero and std of 1. In the right image I added an 'outlier' or out of scale feature no.6 where I set its value to 10. Clearly out of scale. When we normalize the out of scale vector, all other features become very close to 0 as it can be seen in the orange line on the right.
Standardizing the data might be a good thing to do before anything else in this case. Try plotting some histograms of the features or box plots.
You mentioned you are normalizing the vectors to sum up to 1 and now it works better with 10.
That means you are dividing by the 1-norm = sum(abs(X)) instead of the 2-norm (Euclidean) = sum(abs(X).^2)^(1/2). The L1 normalization generates sparser vectors, look at the figure below, where each axis is one feature, so this is a two dimensional space, however it can be generalized to an arbitrary number of dimensions.
Normalizing effectively places each vector on the edge of either shape. For L1 it will lie on the diamond somewhere. For L2 on the circle. When it hits the axis it is zero.
Best Answer
The purpose of the Rectified Linear Activation Function (or ReLU for short) is to allow the neural network to learn nonlinear dependencies.
Specifically, the way this works is that ReLU will return input directly if the value is greater than 0. If less than 0, then 0.0 is simply returned. The idea is to allow the network to approximate a linear function when necessary, with the flexibility to also account for nonlinearity. This article from Machine Learning Mastery goes into more detail on the same.
As for whether having an activation function would make much difference to the analysis, much of this depends on the data. Given that ReLUs can have quite large outputs, they have traditionally been regarded as inappropriate for use with LSTMs.
Let’s consider the following example. Suppose an LSTM is being used as a time series tool to forecast weekly fluctuations in hotel cancellations (all values in the time series are positive, as the number of cancellations cannot be negative). The network structure is as follows:
When the predictions are compared with the test data, the following readings are obtained:
Now, suppose that a ReLU activation function is invoked:
We see better performance on MFE and slightly worse performance for RMSE. That said, note the difference between the two graphs:
Predictions without ReLU
Predictions with ReLU
We can see that the predictions with ReLU flatten out the volatility in the time series. While this might result in better performance on some metrics (in this case RMSE), it also means that the network is not picking up the right volatility trends in the data as the activation function is not appropriate for the type of data under analysis. Therefore, superior performance for MFE becomes irrelevant under these circumstances.
In this regard, one should not use ReLU (or any activation function for that matter) blindly – it may not be appropriate for the data (or model) in question.