Sparse representations are expected in hierarchical models. Possibly, what you are discovering is a problem intrinsic to the hierarchical structure of deep learning models. You will find quite a few scientific papers on "sparse representations", especially in memory research.
I think you would benefit from reading about "receptor fields" in visual cortex. Not only are there ON and OFF cells in the mammal brain, but also RF cells that fire both during ON and OFF. Perhaps the edge/sparsity problem could be circumvented by updating the model to reflect current neuroscience on vision, especially in animal models.
In Deep Learning there are no hard & fast rules to set the number of layers, the number of hidden units per layer and not even the kind of connections between layers: who claims the contrary often doesn't have experience of struggling with modern architectures, and relies on insight 20 years (if not more) old. "Proofs" of my claim:
you have all kinds of different architectures performing increasingly better on ImageNet: VGG, Inception, ResNet, ResNext, Xception, DenseNet, etc. Even the simplest empirical rules such as "double the number of channels before pooling" or even just "use pooling after a convolutional layer" aren't valid in general, though replacing pooling with increasing strides in successive layers has its drawbacks
if the "back-of-the-envelope" rules to choose the number of hidden layers & neurons from the days of old were still valid today, we wouldn't have so much investment on automated architecture learning - I think after Deep RL and GANs, this is currently the single biggest R&D investment in AI companies.
But that's not necessarily bad news: your architecture is getting excellent test set accuracy as it is, so maybe you just don't have to worry about optimal architecture. Accuracy is not the correct metric to look at for classification, but as you have nearly balanced classes, and anyway astonishing classification accuracy, it's not like using another metric will make a big difference.
Since you haven't provided some data, or some training curves, it's hard to say if overfitting to the training set is actually happening. However, there are a few checks you could make, to clear any doubts. Since your architecture is relatively small ($25\times100+100\times50+50\times25+25=8775$ parameters, if I'm not mistaken), these checks won't take you much time. Every time I mention retraining in the following tips, I mean "retraining from scratch", i.e., after weight initialisation, not from the current weights.
create new training/test sets with a different random split, possibly having the same class ratio for train & test as in the whole dataset ($\frac{250}{150}=\frac{5}{3}$) and retrain
use a different estimate of generalization error, e.g., $k-$fold cross-validation instead than train/validation/test split. This time you will need to train a few (exactly $k$) times, so this check is a bit more time consuming than the others.
verify that all data preprocessing steps (apparently, MaxMinScaler
only, in your case), have been fit on the training set and applied to the test set, without refitting them on the test set
you already checked that your problem is actually super-easy: both linear SVM and logistic regression could solve it, so that's an excellent reason to suspect that no overfitting to the test is occuring. It's just an easy problem. Lucky you :-)
I would have a look at decision regions for a few pairs or triplets of variables, maybe the most influential ones according to linear SVM or logreg. Maybe you'll find out that your problem is (close to) linearly separable, so basically most classifiers will do a great job here.
if you still are worried, shuffle class labels and retrain. Now the only way for your neural network to get high training set accuracy is to memorize the training set, which will manifest in much longer training time. At the same time, the test set will go down dramatically. If this doesn't happen, there's something seriously wrong somewhere in your code.
Perform the opposite test: initialise weights and train on just two or three data points. This time, train accuracy will immediately go to 100%, but test set accuracy will stay extremely low, no matter how long you train. If this doesn't happen, again you have a serious bug somewhere.
If at this point you're still worried (you shouldn't), get new test points (which you've never seen until now) and test on them.
Another possibility may be to just scrap the NN and use the linear SVM: you gain
- much faster training & inference
- better interpretability
- convex optimization, i.e., no doubts about the number of training epochs
- cross-validation becomes very simple and feasible
- online training become fast and simple. Well, actually with such a small neural network you could easily perform online training with the NN too.
Best Answer
I am a total novice to this, but my understanding is the following:
input layer - one neuron per input (feature), these are not typical neurons but simply pass the data through to the next layer
hidden layers - simplest structure is to have one neuron in the hidden layer, but deep networks have many neurons and many hidden layers.
output layer - this is the final hidden layer and should have as many neurons as there are outputs to the classification problem. For instance:
Reference: https://machinelearningmastery.com/deep-learning-with-python/