Softmax for MNIST should be able to achieve pretty decent result (>95% accuracy) without any tricks. It can be mini-batch based or just single-sample SGD. For example, an example tutorial code developed from the scratch is given at: https://github.com/2015xli/multilayer-perceptron.
It has two implementations, one with mini-batch SGD, the other with single-sample SGD. The code is in Python, but is essentially the same if written in C++.
Without seeing your code, it is not very straightforward to tell where the problem comes from in your implementation (although your description looks confusing on how softmax really works.) The potential solutions can be as follows.
Firstly, I guess this is because you did not implement softmax correctly. For one thing, your definition of the cost function is problematic, since it does not tell how the index $i$ is given, and it does not even use the label $y$ at all. The cost function should be:
$$
\mathcal L (D;W,b) = -\frac{1}{|D|}\sum_{x\in D} \sum_{i=0}^{k} \delta_{i=y} \ln P(i|x)~,
$$
It looks like the $i$ in your definition refers to the label $y$. If that is the case, then your definition is correct. The problem is, the unclear definition is highly likely to result with unclear code (i.e., incorrect implementation).
For example, your weights update equation seems correct, while it is kind of subtle to make all the indices correct, especially for mini-batch computation. I would suggest you to use single-sample SGD (i.e., the size of mini-batch is just 1) to try again. It is much simpler, while giving almost equal accuracy to bigger batch size. With single-sample, the weights update is simply:
$$
w_{nl}^{(t)} = w_{nl}^{(t-1)} - x_l\left(\delta_{n=y} - P(n|x)\right)
$$
Secondly, you could try to normalize the input by dividing x with 255, and also try to initialize the weights to be small reals. That can help reduce the case of "collapse". But this would not be the real cause of "collapse".
Finally, assuming you've got everything correctly implemented, you could try with more layers to improve the accuracy, in order to introduce more capacity and non-linearity. The tutorial code example above uses two layers, layer one with 784$\rightarrow$785 fully connected network plus sigmoid activation for the non-linearity, layer two with 785$\rightarrow$10 softmax classification. (The design options for the hidden layer size and activation function can vary.)
That being said, single-layer of softmax still is able to bring you 85% accuracy. The training should not collapse. You can try with the example code by removing the hidden layer.
Feel free letting me know if you have more questions, or you can send me your code for a checking up.
The spectral radius of a matrix is not always equal to the matrix's norm.
$A=\left(\begin{matrix}0 & 1\\
0 & 0
\end{matrix}\right)$ is a counterexample, as $0$ is the only eigenvalue of $A$, and so the spectral radius of $A$ is $0$.
On the other hand, the largest singular value of a matrix is always equal to the matrix's norm, as shown here.
The paper assumes that $W_{rec}$ is a square matrix, and thus $||W_{rec}||=||W_{rec}^T||$, as shown here (which uses what is shown here).
$\lambda_1$ is the largest singular value of $W_{rec}$. Therefore, we get that $\lambda_1=||W_{rec}^T||$, and so $||W_{rec}^T||<\frac{1}{\gamma}$.
Best Answer
I have been looking for the same question. I have finally deduced the following. I think it is a learning factor that balance the importance between terms (codebook loss and commitment loss).
If the Beta factor is smaller than 1, it means that the encoder is updated more faster than the codebook.
That is interesting for example if we think about it from a centroid perspective (codebook), we do not want them to update strongly in each iteration because we have to preserve some information of the previous batches (and more important if the batch is small).
In short, we want the centroids (codebook) to move slowly and the encoder samples can be updated faster. Probably this technique can minimize the noise produced by the mini-batch sampling in contrast than use all the dataset.
This is what I have deduced, if it is not correct please someone indicate it.