Softmax for MNIST should be able to achieve pretty decent result (>95% accuracy) without any tricks. It can be mini-batch based or just single-sample SGD. For example, an example tutorial code developed from the scratch is given at: https://github.com/2015xli/multilayer-perceptron.
It has two implementations, one with mini-batch SGD, the other with single-sample SGD. The code is in Python, but is essentially the same if written in C++.
Without seeing your code, it is not very straightforward to tell where the problem comes from in your implementation (although your description looks confusing on how softmax really works.) The potential solutions can be as follows.
Firstly, I guess this is because you did not implement softmax correctly. For one thing, your definition of the cost function is problematic, since it does not tell how the index $i$ is given, and it does not even use the label $y$ at all. The cost function should be:
$$
\mathcal L (D;W,b) = -\frac{1}{|D|}\sum_{x\in D} \sum_{i=0}^{k} \delta_{i=y} \ln P(i|x)~,
$$
It looks like the $i$ in your definition refers to the label $y$. If that is the case, then your definition is correct. The problem is, the unclear definition is highly likely to result with unclear code (i.e., incorrect implementation).
For example, your weights update equation seems correct, while it is kind of subtle to make all the indices correct, especially for mini-batch computation. I would suggest you to use single-sample SGD (i.e., the size of mini-batch is just 1) to try again. It is much simpler, while giving almost equal accuracy to bigger batch size. With single-sample, the weights update is simply:
$$
w_{nl}^{(t)} = w_{nl}^{(t-1)} - x_l\left(\delta_{n=y} - P(n|x)\right)
$$
Secondly, you could try to normalize the input by dividing x with 255, and also try to initialize the weights to be small reals. That can help reduce the case of "collapse". But this would not be the real cause of "collapse".
Finally, assuming you've got everything correctly implemented, you could try with more layers to improve the accuracy, in order to introduce more capacity and non-linearity. The tutorial code example above uses two layers, layer one with 784$\rightarrow$785 fully connected network plus sigmoid activation for the non-linearity, layer two with 785$\rightarrow$10 softmax classification. (The design options for the hidden layer size and activation function can vary.)
That being said, single-layer of softmax still is able to bring you 85% accuracy. The training should not collapse. You can try with the example code by removing the hidden layer.
Feel free letting me know if you have more questions, or you can send me your code for a checking up.
It is actually the opposite of what you said "The sigmoid binary loss would encourage the true label logits to be much higher than 0, and the other logits to be much smaller than 0. I think everyone agrees that this is intuitive."
When applying the logit and taking binary cross-entropy loss, we encourage each output component to be independent of each other: one logit is higher doesn't push the other logits to be smaller. This is exactly the opposite of softmax, which encourages one logit to be higher while all others to be smaller. This is why it's counterintuitive, as indicated in the paper.
Best Answer
Intuition is a funky concept. For an ex-physicist, myself, seeing softmwax for the first time was "Ok, this is Boltzmann distribution." For a statistician it would be "Oh, isn't this mlogit?"
Physicist's intuition
Softmax is literally the case of canonical ensemble: $$ p_i=\frac 1 Q e^{- {\varepsilon}_i / (k T)}=\frac{e^{- {\varepsilon}_i / (kT)}}{\sum_{j=1}^{n}{e^{- {\varepsilon}_j / (k T)}}}$$ The denominator is called a canonical partition function, it's basically a normalizing constant to make sure the probabilities add up to 100%. But it has a physical meaning too: the system can only be in one of its M states, that's why probabilities must add up. This stuff is straight up from statistical mechanics.
The probability of a state $i$ is defined by its energy $\varepsilon_i$ relative to the energies of all other states. You see, in physics systems always try to minimize the energy, so the probability of the state with the lowest energy must be the highest. However, if the temperature of the system $T$ is high, then the difference in probabilities of the lowest energy state and other states will vanish: $$\lim_{T\to\infty}p_{min}/p_i=\lim_{T\to\infty}e^{ ({\varepsilon_i- \varepsilon}_{min}) / (k T)}=1$$
So, in OP's equation the energy $\varepsilon=-z$ and the temperature is $T\sim 1/\lambda$. He also isolates the base state, and sets its probability with 1 instead of the exponential. This doesn't change anything for intuition, it only sets all energies relative to a chosen base state. This is VERY intuitive to a physicist.
Statistician's intuition
A statistician will immediately recognize the multinomial logit regression. For those who only know bivariate logit regression, here's how mlogit works.
Estimate $n-1$ bivariate logits of $n-1$ states vs a chosen base state on the censored data set. So, you create a dataset from a base state, say 1, and one of the states $i\in[2,n]$. This way you get $n-1$ logits for each $i$, conditional ones: $$\ln\frac{Pr[i|i\cup 1]}{Pr[1|i\cup 1]}\sim X_i$$ This equation is more recognizable as: $$\ln\frac{p}{1-p}\sim X_i$$ This is how it is usually presented in bivariate cases, where there are only categories to choose from, like in our censored subset of the full dataset with $n$ categories.
Using Bayes theorem we know that: $$Pr[i|i\cup 1]=\frac{Pr[i]}{Pr[i]+Pr[1]}$$ So, we can trivially combine $n-1$ bivariate regressions into a single one to get unconditional probabilities: $$Pr[i]=\frac{e^{X_i\beta_i}}{1+\sum_i e^{X_i\beta_i}}$$ This gets us OP's equation.