Softmax for MNIST should be able to achieve pretty decent result (>95% accuracy) without any tricks. It can be mini-batch based or just single-sample SGD. For example, an example tutorial code developed from the scratch is given at: https://github.com/2015xli/multilayer-perceptron.
It has two implementations, one with mini-batch SGD, the other with single-sample SGD. The code is in Python, but is essentially the same if written in C++.
Without seeing your code, it is not very straightforward to tell where the problem comes from in your implementation (although your description looks confusing on how softmax really works.) The potential solutions can be as follows.
Firstly, I guess this is because you did not implement softmax correctly. For one thing, your definition of the cost function is problematic, since it does not tell how the index $i$ is given, and it does not even use the label $y$ at all. The cost function should be:
$$
\mathcal L (D;W,b) = -\frac{1}{|D|}\sum_{x\in D} \sum_{i=0}^{k} \delta_{i=y} \ln P(i|x)~,
$$
It looks like the $i$ in your definition refers to the label $y$. If that is the case, then your definition is correct. The problem is, the unclear definition is highly likely to result with unclear code (i.e., incorrect implementation).
For example, your weights update equation seems correct, while it is kind of subtle to make all the indices correct, especially for mini-batch computation. I would suggest you to use single-sample SGD (i.e., the size of mini-batch is just 1) to try again. It is much simpler, while giving almost equal accuracy to bigger batch size. With single-sample, the weights update is simply:
$$
w_{nl}^{(t)} = w_{nl}^{(t-1)} - x_l\left(\delta_{n=y} - P(n|x)\right)
$$
Secondly, you could try to normalize the input by dividing x with 255, and also try to initialize the weights to be small reals. That can help reduce the case of "collapse". But this would not be the real cause of "collapse".
Finally, assuming you've got everything correctly implemented, you could try with more layers to improve the accuracy, in order to introduce more capacity and non-linearity. The tutorial code example above uses two layers, layer one with 784$\rightarrow$785 fully connected network plus sigmoid activation for the non-linearity, layer two with 785$\rightarrow$10 softmax classification. (The design options for the hidden layer size and activation function can vary.)
That being said, single-layer of softmax still is able to bring you 85% accuracy. The training should not collapse. You can try with the example code by removing the hidden layer.
Feel free letting me know if you have more questions, or you can send me your code for a checking up.
The difference is that the first formula is the derivation of just the output of a perceptron, while the second is the derivation of the non-linear activation of the perceptron.
When stacking perceptron layers (MLP - Multi-layered Perceptron), you have to add some non-linearity on the output of each layers, otherwise all the process is linear (and can be modeled with a single layer).
So the output of the perceptron (or more accurately, the input of the next layer) becomes:
The derivation will be as in your second formula.
If you are not using a non-linear activation (single layer), the output is:
and the derivation is as in your first formula.
Best Answer
Just some general advice
try to limit the indexation where possible, and use matrix algebra
As the dimension of $Y_i$ varies with $i$ best not to store as a matrix. Treat $i$ separately.
Alternatively, you could define Y as one very long vector $Y=(Y_1^T,\dots,Y_m^T)^T$. Similarly, $X$ would have $\sum_{i=1}^{m}n_i$ rows and $d$ columns. But then $W$ needs to be redefined as the driect sum $W=\oplus_{i=1}^{m}W_i$. But then $W$ now has structural zeros. Too complicated to work with...
don't use indices more than once. For example you use $w_{j,k}$ and also use $j,k$ as summation variables. Should use $w_{r,s}$ instead
So I would write your cost function as $$J=\frac{1}{2}\sum_{i=1}^{m}(X_iW_i-Y_i)^T (X_iW_i-Y_i) +\lambda (W_i-\overline{W})^T (W_i-\overline{W})$$ Where $\overline{W}=\frac{1}{m}\sum_{i=1}^{m}W_i$ Now using the chain rule we have $\frac{\partial e^Te}{\partial W_r}= 2\frac{\partial e}{\partial W_r} e$
So you have $$ \frac{\partial J}{\partial W_r} =X_r^T (X_rW_r-Y_r) +2\lambda \sum_{i=1}^{m} \left(\frac{\partial W_i}{\partial W_r}- \frac{\partial \overline{W} }{\partial W_r}\right)(W_i-\overline{W})$$ $$ =X_r^T (X_rW_r-Y_r) +2\lambda (W_r-\overline{W})$$
This is not the answer you have.
update
One way you can re-express the equations is by setting $ X=\oplus_{i=1}^mX_i $ (which has $\sum_{i=1}^mn_i $ rows and $ dm $ colums, and $ w=(W_1^T,\dots, W_m^T)^T $ and $ Y=(Y_1^T,\dots, Y_m^T)^T $. We can also re-express the penalty term as $\sum_{i=1}^m(W_i-\overline {W})^T (W_i-\overline {W})=w^Tw-m\overline {W}^T\overline {W} =w^T (I-m^{-1}G^TG) w $ where $ G$ is the $ d\times md $ matrix which calculates the totals for $ w $. So the $ k $ th row of $ G $ has ones in columns $k, d+k, 2d+k, \dots, (m-1) d+k $ and zeroes everywhere else. We can also write the other factor as $\frac{1}{2}(Y-Xw)^T (Y-Xw) $. Hence an explicit solution is given as
$$\hat {w}=\left [X^TX +2\lambda (I-m^{-1} G^TG)\right]^{-1} X^TY $$
It will probably be more efficient to implement by first use the woodbury matrix identity, as $ X^TX $ is a block- diagonal matrix.