Solved – Gradient descent applied to softmax regression

classificationgradient descentmachine learningoptimizationsoftmax

I am currently trying to reimplement a softmax regression to classify MNIST handwritten digits. I not a machine learner and my plan was to get an intuition of the entire workflow that has to be developed to learn a model. So I wrote a simple C++ program that optimize the following probabilistic model

$$ P(i|x) = \frac{\exp\left(w_i^T x + b_i\right)}{\sum_{i=1}^k\exp\left(w_j^T x + b_j\right)}$$
where $i,j,k=0,\dots,9$ are the classes for the handwritten digits, $w_i,x\in \mathbb{R}^{748}$ and $b_j\in \mathbb{R}$. As objective function I made use of
$$ \mathcal L (D;W,b) = -\frac{1}{|D|}\sum_{(i,x)\in D} \ln P(i|x)~, $$
where $D$ is the entire training sample. To minimize the $\mathcal L$, I used mini-batch stochastic gradient descent with the derivatives

$$ \frac{\partial}{\partial w_{nl}} \ln P(i|x) = x_l\left(\delta_{i,n} – P(n|x)\right)$$

$$ \frac{\partial}{\partial b_{n}} \ln P(i|x) = \delta_{i,n} – P(n|x)$$
where I introduced the Kronecker-delta $\delta_{i,n}$ which is 1 if $i=n$ and 0 otherwise. Here begins my question. MNIST handwritten digits have 70 K samples, I used 10 K for testing (which I never reached) 6K for Validation and the remaining 54K for training. To train the model I used minibatch stochastic gradient descent with batch-size 200 and update equation
$$ w_{nl}^{(t)} = w_{nl}^{(t-1)} – \frac{\eta}{|B|} \sum_{(i,x)\in B} x_l\left(\delta_{i,n} – P(n|x)\right)$$
and
$$ b_{n}^{(t)} = b_{n}^{(t-1)}- \frac{\eta}{|B|} \sum_{(i,x)\in B} \left(\delta_{i,n} – P(n|x)\right) $$
where $B$ is the current batch and the learning rate was take to be 0.1. When I trained the model I achieved an accuracy of 60-70 % before the model some how collapses. Meaning that probabilities of trainings samples become zero such that the log-likelihood becomes infinite. I played a lot around with different batch size and learning rates, but where not able to figure out my mistake. Moreover, I implemented an adaptive learning rate as shown below. However,it did not help to solve the problem.

log-likelihood plotted against the accuracy and the learning rate

For illustrating purpose I generated two plots of the log-likelihood and Accuracy and learning rate for a batch size of 200 samples. It is worth mentioning that I did not include the collapse in the plot. The next value of the curve was infinite.

Best Answer

Softmax for MNIST should be able to achieve pretty decent result (>95% accuracy) without any tricks. It can be mini-batch based or just single-sample SGD. For example, an example tutorial code developed from the scratch is given at: https://github.com/2015xli/multilayer-perceptron. It has two implementations, one with mini-batch SGD, the other with single-sample SGD. The code is in Python, but is essentially the same if written in C++.

Without seeing your code, it is not very straightforward to tell where the problem comes from in your implementation (although your description looks confusing on how softmax really works.) The potential solutions can be as follows.

Firstly, I guess this is because you did not implement softmax correctly. For one thing, your definition of the cost function is problematic, since it does not tell how the index $i$ is given, and it does not even use the label $y$ at all. The cost function should be: $$ \mathcal L (D;W,b) = -\frac{1}{|D|}\sum_{x\in D} \sum_{i=0}^{k} \delta_{i=y} \ln P(i|x)~, $$

It looks like the $i$ in your definition refers to the label $y$. If that is the case, then your definition is correct. The problem is, the unclear definition is highly likely to result with unclear code (i.e., incorrect implementation).

For example, your weights update equation seems correct, while it is kind of subtle to make all the indices correct, especially for mini-batch computation. I would suggest you to use single-sample SGD (i.e., the size of mini-batch is just 1) to try again. It is much simpler, while giving almost equal accuracy to bigger batch size. With single-sample, the weights update is simply: $$ w_{nl}^{(t)} = w_{nl}^{(t-1)} - x_l\left(\delta_{n=y} - P(n|x)\right) $$

Secondly, you could try to normalize the input by dividing x with 255, and also try to initialize the weights to be small reals. That can help reduce the case of "collapse". But this would not be the real cause of "collapse".

Finally, assuming you've got everything correctly implemented, you could try with more layers to improve the accuracy, in order to introduce more capacity and non-linearity. The tutorial code example above uses two layers, layer one with 784$\rightarrow$785 fully connected network plus sigmoid activation for the non-linearity, layer two with 785$\rightarrow$10 softmax classification. (The design options for the hidden layer size and activation function can vary.)

That being said, single-layer of softmax still is able to bring you 85% accuracy. The training should not collapse. You can try with the example code by removing the hidden layer.

Feel free letting me know if you have more questions, or you can send me your code for a checking up.