I'm currently following this lesson: http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/ , and after I coded their equation I got:
Training accuracy: 71.2%
Test accuracy: 72.0%
Here is the relevant piece of code:
% function [f,g] = softmax_regression(theta, X,y)
function [f, g] = softmax_regression_vec(theta, X, y)
%
% Arguments:
% theta - A vector containing the parameter values to optimize.
% In minFunc, theta is reshaped to a long vector. So we need to
% resize it to an n-by-(num_classes-1) matrix.
% Recall that we assume theta(:,num_classes) = 0.
%
% X - The examples stored in a matrix.
% X(i,j) is the i'th coordinate of the j'th example.
% y - The label for each example. y(j) is the j'th example's label.
%
m=size(X,2);
n=size(X,1);
% theta is a vector; need to reshape to n x num_classes.
theta=reshape(theta, n, []);
num_classes=size(theta,2)+1;
% initialize objective value and gradient.
f = 0;
g = zeros(size(theta));
%
% TODO: Compute the softmax objective function and gradient using vectorized code.
% Store the objective function value in 'f', and the gradient in 'g'.
% Before returning g, make sure you form it back into a vector with g=g(:);
%
sequence = 1 : num_classes ;
sequence = sequence';
for i = 1 : m
% Objective function
expo = exp(theta' * X(:,i));
totalSum = sum(expo) + 1;
P = [expo / totalSum ; 1 / totalSum];
f = f - log(P(y(i))) ;
% Gradient
diff = (y(i) == sequence) - P;
tmp = bsxfun(@times, X(:,i), diff' );
tmp = bsxfun(@plus, tmp, tmp(:,num_classes));
g = g - tmp(:,1 : num_classes - 1) ;
end
%%% YOUR CODE HERE %%%
g=g(:); % make gradient a vector for minFunc
Is it unusual for softmax to achieve 70% ? Please help me thank you very much
Best Answer
Here are some things you should check that might be holding you back:
Use a suitable loss function (i.e., the cross-entropy loss).
Use weight decay or regularization (tune the associated hyper-parameter using cross-validation).
Since you manually wrote code to compute the code, don't forget to implement a gradient check, to test for bugs in your code.
You might want to use an off-the-shelf optimizer rather than implementing gradient descent yourself. This will typically include improvements like momentum, learning rate, L-BGFS, or other techniques.
You might want to initialize the weights not to all zeros, but rather to small random numbers (e.g., chosen from a Gaussian), to break symmetry.
You might need to tune hyperparameters for your optimization routine (learning rate, momentum, etc.) via cross-validation.