Softmax for MNIST should be able to achieve pretty decent result (>95% accuracy) without any tricks. It can be mini-batch based or just single-sample SGD. For example, an example tutorial code developed from the scratch is given at: https://github.com/2015xli/multilayer-perceptron.
It has two implementations, one with mini-batch SGD, the other with single-sample SGD. The code is in Python, but is essentially the same if written in C++.
Without seeing your code, it is not very straightforward to tell where the problem comes from in your implementation (although your description looks confusing on how softmax really works.) The potential solutions can be as follows.
Firstly, I guess this is because you did not implement softmax correctly. For one thing, your definition of the cost function is problematic, since it does not tell how the index $i$ is given, and it does not even use the label $y$ at all. The cost function should be:
$$
\mathcal L (D;W,b) = -\frac{1}{|D|}\sum_{x\in D} \sum_{i=0}^{k} \delta_{i=y} \ln P(i|x)~,
$$
It looks like the $i$ in your definition refers to the label $y$. If that is the case, then your definition is correct. The problem is, the unclear definition is highly likely to result with unclear code (i.e., incorrect implementation).
For example, your weights update equation seems correct, while it is kind of subtle to make all the indices correct, especially for mini-batch computation. I would suggest you to use single-sample SGD (i.e., the size of mini-batch is just 1) to try again. It is much simpler, while giving almost equal accuracy to bigger batch size. With single-sample, the weights update is simply:
$$
w_{nl}^{(t)} = w_{nl}^{(t-1)} - x_l\left(\delta_{n=y} - P(n|x)\right)
$$
Secondly, you could try to normalize the input by dividing x with 255, and also try to initialize the weights to be small reals. That can help reduce the case of "collapse". But this would not be the real cause of "collapse".
Finally, assuming you've got everything correctly implemented, you could try with more layers to improve the accuracy, in order to introduce more capacity and non-linearity. The tutorial code example above uses two layers, layer one with 784$\rightarrow$785 fully connected network plus sigmoid activation for the non-linearity, layer two with 785$\rightarrow$10 softmax classification. (The design options for the hidden layer size and activation function can vary.)
That being said, single-layer of softmax still is able to bring you 85% accuracy. The training should not collapse. You can try with the example code by removing the hidden layer.
Feel free letting me know if you have more questions, or you can send me your code for a checking up.
Best Answer
You are correct up to the second line of your working in the last part, and then you make an error by dropping the requirement that $j=p$ (which means you retain an additional sum that shouldn't be there). Continuing from your last correct step, you should have:
$$\begin{aligned} \frac{\partial \ell}{\partial \Theta_p}(\Theta) &= \sum_{i=1}^m \sum_{j=1}^k \mathbb{I}(y^{(i)}=j) (\delta_{pj}-s_p) x^{(i)} \\[6pt] &= \sum_{i=1}^m x^{(i)} \sum_{j=1}^k \mathbb{I}(y^{(i)}=j) (\delta_{pj}-s_p) \\[6pt] &= \sum_{i=1}^m x^{(i)} \bigg[ \sum_{j=1}^k \mathbb{I}(y^{(i)}=j) \mathbb{I}(p=j) - s_p \sum_{j=1}^k \mathbb{I}(y^{(i)}=j) \Bigg] \\[6pt] &= \sum_{i=1}^m x^{(i)} \bigg[ \mathbb{I}(y^{(i)}=p) - s_p \Bigg] \\[6pt] &= \sum_{i=1}^m x^{(i)} \mathbb{I}(y^{(i)}=p) - s_p \sum_{i=1}^m x^{(i)}. \\[6pt] \end{aligned} $$
(The penultimate step follows from the fact that $\sum_{j=1}^k \mathbb{I}(y^{(i)}=j) = 1$ for all $i = 1, ..., m$.)