Solved – Results of Softmax regression on MNIST dataset

MATLABsoftmax

I'm currently following this lesson: http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/ , and after I coded their equation I got:

Training accuracy: 71.2%
Test accuracy: 72.0%

Here is the relevant piece of code:

% function [f,g] = softmax_regression(theta, X,y)
function [f, g] = softmax_regression_vec(theta, X, y)
  %
  % Arguments:
  %   theta - A vector containing the parameter values to optimize.
  %       In minFunc, theta is reshaped to a long vector.  So we need to
  %       resize it to an n-by-(num_classes-1) matrix.
  %       Recall that we assume theta(:,num_classes) = 0.
  %
  %   X - The examples stored in a matrix.  
  %       X(i,j) is the i'th coordinate of the j'th example.
  %   y - The label for each example.  y(j) is the j'th example's label.
  %
  m=size(X,2);
  n=size(X,1);

  % theta is a vector;  need to reshape to n x num_classes.
  theta=reshape(theta, n, []);
  num_classes=size(theta,2)+1;

  % initialize objective value and gradient.
  f = 0;
  g = zeros(size(theta));

  %
  % TODO:  Compute the softmax objective function and gradient using vectorized code.
  %        Store the objective function value in 'f', and the gradient in 'g'.
  %        Before returning g, make sure you form it back into a vector with g=g(:);
  %
  sequence = 1 : num_classes ;
  sequence = sequence';
  for i = 1 : m
      % Objective function
      expo = exp(theta' * X(:,i));
      totalSum = sum(expo) + 1;
      P = [expo / totalSum ; 1 / totalSum];
      f = f - log(P(y(i))) ;      

      % Gradient
      diff = (y(i) == sequence) - P;
      tmp = bsxfun(@times, X(:,i), diff' );      
      tmp = bsxfun(@plus, tmp, tmp(:,num_classes));       
      g = g - tmp(:,1 : num_classes - 1) ;
  end
%%% YOUR CODE HERE %%%

  g=g(:); % make gradient a vector for minFunc

Is it unusual for softmax to achieve 70% ? Please help me thank you very much

Best Answer

Here are some things you should check that might be holding you back:

  • Use a suitable loss function (i.e., the cross-entropy loss).

  • Use weight decay or regularization (tune the associated hyper-parameter using cross-validation).

  • Since you manually wrote code to compute the code, don't forget to implement a gradient check, to test for bugs in your code.

  • You might want to use an off-the-shelf optimizer rather than implementing gradient descent yourself. This will typically include improvements like momentum, learning rate, L-BGFS, or other techniques.

  • You might want to initialize the weights not to all zeros, but rather to small random numbers (e.g., chosen from a Gaussian), to break symmetry.

  • You might need to tune hyperparameters for your optimization routine (learning rate, momentum, etc.) via cross-validation.