MNIST Digit Recognition – Best Results with Fully Connected Neural Networks

backpropagationdeep learningimage processingmachine learningneural networks

To fully understand how it works internally, I'm re-writing a neural network from scratch in Python + numpy only. (As it's for learning purposes, performance is not an issue).

Before moving to convolutional networks (CNN), or more complex tools, etc., I'd like to determine the maximum accuracy we can hope with only a standard NN, (a few fully-connected hidden layers + activation function), with the MNIST digit database.

I get a max of ~96.2% accuracy with:

  • network structure: [784, 200, 80, 10]
  • learning_rate: 0.01
  • epoch: 3
  • no biases used
  • activation function: sigmoid (1/(1+exp(-x)))
  • initialization weights: [-1, 1] truncated-normal distribution
  • optimization process: pure stochastic gradient descent

I read in the past that it's possible that to get 98% even with a standard NN.

Question: what parameters (as shown above) would you use to get more than 98% accuracy on the MNIST digit database with a standard NN? See full code below.


What I've tried so far:

  • replace the weights by normal distribution multiplied by various factors ("He et al init method" or "Xavier" init), see also What are good initial weights in a neural network?:

    wm = np.random.randn(nodes_out, nodes_in + bias_node) * np.sqrt(2/nodes_in)  # also tried with np.sqrt(1/nodes_in)
    

    but it did not change anything significantly, I noticed it was even worse in this case

  • replaced the sigmoid by ReLU:

    def activation_function(x): 
        return np.maximum(0, x)
    

    For an unknown reason, the accuracy dropped to 10% (i.e. the NN is useless!) with this activiation_function.


Self-contained code (~ 100 lines of code), that you can directly run (largely coming from https://www.python-course.eu/neural_network_mnist.php, but a bit rewritten), you only need to download mnist_train.csv and mnist_test.csv first:

from __future__ import division
import matplotlib.pyplot as plt
import numpy as np
from scipy.special import expit as activation_function  # 1/(1+exp(-x)), sigmoid
from scipy.stats import truncnorm

if True:  # recreate MNIST arrays. Do it only once, after that modify to False
    train_data = np.loadtxt("mnist_train.csv", delimiter=",")
    test_data = np.loadtxt("mnist_test.csv", delimiter=",")
    train_imgs = np.asfarray(train_data[:, 1:]) / 255.0
    test_imgs = np.asfarray(test_data[:, 1:]) / 255.0
    train_labels = np.asfarray(train_data[:, :1])
    test_labels = np.asfarray(test_data[:, :1])
    lr = np.arange(10)
    train_labels_one_hot = (lr==train_labels).astype(np.float)
    test_labels_one_hot = (lr==test_labels).astype(np.float)
    for i, d in enumerate([train_imgs, test_imgs, train_labels, test_labels, train_labels_one_hot, test_labels_one_hot]):
        np.save('%i.array' % i, d)

(train_imgs, test_imgs, train_labels, test_labels, train_labels_one_hot, test_labels_one_hot) = [np.load('%i.array.npy' % i) for i in range(6)]

print 'Data loaded.'

if False:  # show images
    for i in range(10):
        img = train_imgs[i].reshape((28,28))
        plt.imshow(img, cmap="Greys")
        plt.show()

class NeuralNetwork:
    def __init__(self, network_structure, learning_rate, bias=None):  
        self.structure = network_structure
        self.no_of_layers = len(self.structure)
        self.learning_rate = learning_rate 
        self.bias = bias
        self.create_weight_matrices()

    def create_weight_matrices(self):
        bias_node = 1 if self.bias else 0
        self.weights_matrices = []
        for k in range(self.no_of_layers-1):
            nodes_in = self.structure[k]
            nodes_out = self.structure[k+1]
            n = (nodes_in + bias_node) * nodes_out
            X = truncnorm(-1, 1,  loc=0, scale=1)
            #X = truncnorm(-1 / np.sqrt(nodes_in), 1 / np.sqrt(nodes_in),  loc=0, scale=1)  # accuracy is worse
            wm = X.rvs(n).reshape((nodes_out, nodes_in + bias_node))
            self.weights_matrices.append(wm)

    def train(self, input_vector, target_vector): 
        input_vector = np.array(input_vector, ndmin=2).T
        res_vectors = [input_vector]
        for k in range(self.no_of_layers-1):
            in_vector = res_vectors[-1]
            if self.bias:
                in_vector = np.concatenate((in_vector, [[self.bias]]))
                res_vectors[-1] = in_vector
            x = np.dot(self.weights_matrices[k], in_vector)
            out_vector = activation_function(x)
            res_vectors.append(out_vector)    

        target_vector = np.array(target_vector, ndmin=2).T
        output_errors = target_vector - out_vector  
        for k in range(self.no_of_layers-1, 0, -1):
            out_vector = res_vectors[k]
            in_vector = res_vectors[k-1]
            if self.bias and not k==(self.no_of_layers-1):
                out_vector = out_vector[:-1,:].copy()
            tmp = output_errors * out_vector * (1.0 - out_vector)  # sigma'(x) = sigma(x) (1 - sigma(x))
            tmp = np.dot(tmp, in_vector.T)
            self.weights_matrices[k-1] += self.learning_rate * tmp
            output_errors = np.dot(self.weights_matrices[k-1].T, output_errors)
            if self.bias:
                output_errors = output_errors[:-1,:]

    def run(self, input_vector):
        if self.bias:
            input_vector = np.concatenate((input_vector, [self.bias]))
        in_vector = np.array(input_vector, ndmin=2).T
        for k in range(self.no_of_layers-1):
            x = np.dot(self.weights_matrices[k], in_vector)
            out_vector = activation_function(x)
            in_vector = out_vector
            if self.bias:
                in_vector = np.concatenate((in_vector, [[self.bias]]))
        return out_vector

    def evaluate(self, data, labels):
        corrects, wrongs = 0, 0
        for i in range(len(data)):
            res = self.run(data[i])
            res_max = res.argmax()
            if res_max == labels[i]:
                corrects += 1
            else:
                wrongs += 1
        return corrects, wrongs

ANN = NeuralNetwork(network_structure=[784, 200, 80, 10], learning_rate=0.01, bias=None)

for epoch in range(3):
    for i in range(len(train_imgs)):
        if i % 1000 == 0:
            print 'epoch:', epoch, 'img number:', i, '/', len(train_imgs)
        ANN.train(train_imgs[i], train_labels_one_hot[i])

corrects, wrongs = ANN.evaluate(test_imgs, test_labels)
print("accruracy: test", corrects / (corrects + wrongs))

Edit: With 10 epochs, structure [784, 400, 400, 10] and the other parameters identical, I finally got 97.8% accuracy! Is this a case of overfitting (as mentioned in a comment)?

Another test: 20 epochs, structure [784, 700, 500, 10], other parameters identical 97.9% accuracy.

Best Answer

Yann LeCun has compiled a big list of results (and the associated papers) on MNIST, which may be of interest.

The best non-convolutional neural net result is by Cireşan, Meier, Gambardella and Schmidhuber (2010) (arXiv), who reported an accuracy of 99.65%. As their abstract describes, their approach was essentially brute force:

Good old on-line back-propagation for plain multi-layer perceptrons yields a very low 0.35% error rate on the famous MNIST handwritten digits benchmark. All we need to achieve this best result so far are many hidden layers, many neurons per layer, numerous deformed training images, and graphics cards to greatly speed up learning.

The network itself was a six layer MLP with 2500, 2000, 1500, 1000, 500, and 10 neurons per layer, and the training set was augmented with affine and elastic deformations. The only other secret ingredient was a lot of compute--the last few pages describe how they parallelized it.

A year later, the same group (Meier et al., 2011) reported similar results using an ensemble of 25 one-layer neural networks (0.39% test error*). These were individually smaller (800 hidden units), but the training strategy is a bit fancier. Similar strategies with convnets do a little bit better (~0.23% test error*). Since they are universal approximations, I can't see why a suitable MLP wouldn't be able to match that though it might be very large and difficult to train.


* Annoyingly very few of these papers report confidence intervals, standard errors, or anything like that, making it difficult to directly compare these results.