Solved – Building an autoencoder in Tensorflow to surpass PCA

autoencodersdeep learningpcapythontensorflow

Hinton and Salakhutdinov in Reducing the Dimensionality of Data with Neural Networks, Science 2006 proposed a non-linear PCA through the use of a deep autoencoder. I have tried to build and train a PCA autoencoder with Tensorflow several times but I have never been able to obtain better result than linear PCA.

How can I efficiently train an autoencoder?

(Later edit by @amoeba: the original version of this question contained Python Tensorflow code that did not work correctly. One can find it in the edit history.)

Best Answer

Here is the key figure from the 2006 Science paper by Hinton and Salakhutdinov:

It shows dimensionality reduction of the MNIST dataset ($28\times 28$ black and white images of single digits) from the original 784 dimensions to two.

Let's try to reproduce it. I will not be using Tensorflow directly, because it's much easier to use Keras (a higher-level library running on top of Tensorflow) for simple deep learning tasks like this. H&S used $$784\to 1000\to 500\to 250\to 2\to 250\to 500\to 1000\to 784$$ architecture with logistic units, pre-trained with the stack of Restricted Boltzmann Machines. Ten years later, this sounds very old-school. I will use a simpler $$784\to 512\to 128\to 2\to 128\to 512\to 784$$ architecture with exponential linear units without any pre-training. I will use Adam optimizer (a particular implementation of adaptive stochastic gradient descent with momentum).


The code is copy-pasted from a Jupyter notebook. In Python 3.6 you need to install matplotlib (for pylab), NumPy, seaborn, TensorFlow and Keras. When running in Python shell, you may need to add plt.show() to show the plots.

Initialization

%matplotlib notebook

import pylab as plt
import numpy as np
import seaborn as sns; sns.set()

import keras
from keras.datasets import mnist
from keras.models import Sequential, Model
from keras.layers import Dense
from keras.optimizers import Adam

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(60000, 784) / 255
x_test = x_test.reshape(10000, 784) / 255

PCA

mu = x_train.mean(axis=0)
U,s,V = np.linalg.svd(x_train - mu, full_matrices=False)
Zpca = np.dot(x_train - mu, V.transpose())

Rpca = np.dot(Zpca[:,:2], V[:2,:]) + mu    # reconstruction
err = np.sum((x_train-Rpca)**2)/Rpca.shape[0]/Rpca.shape[1]
print('PCA reconstruction error with 2 PCs: ' + str(round(err,3)));

This outputs:

PCA reconstruction error with 2 PCs: 0.056

Training the autoencoder

m = Sequential()
m.add(Dense(512,  activation='elu', input_shape=(784,)))
m.add(Dense(128,  activation='elu'))
m.add(Dense(2,    activation='linear', name="bottleneck"))
m.add(Dense(128,  activation='elu'))
m.add(Dense(512,  activation='elu'))
m.add(Dense(784,  activation='sigmoid'))
m.compile(loss='mean_squared_error', optimizer = Adam())
history = m.fit(x_train, x_train, batch_size=128, epochs=5, verbose=1, 
                validation_data=(x_test, x_test))

encoder = Model(m.input, m.get_layer('bottleneck').output)
Zenc = encoder.predict(x_train)  # bottleneck representation
Renc = m.predict(x_train)        # reconstruction

This takes ~35 sec on my work desktop and outputs:

Train on 60000 samples, validate on 10000 samples
Epoch 1/5
60000/60000 [==============================] - 7s - loss: 0.0577 - val_loss: 0.0482
Epoch 2/5
60000/60000 [==============================] - 7s - loss: 0.0464 - val_loss: 0.0448
Epoch 3/5
60000/60000 [==============================] - 7s - loss: 0.0438 - val_loss: 0.0430
Epoch 4/5
60000/60000 [==============================] - 7s - loss: 0.0423 - val_loss: 0.0416
Epoch 5/5
60000/60000 [==============================] - 7s - loss: 0.0412 - val_loss: 0.0407

so you can already see that we surpassed PCA loss after only two training epochs.

(By the way, it is instructive to change all activation functions to activation='linear' and to observe how the loss converges precisely to the PCA loss. That is because linear autoencoder is equivalent to PCA.)

Plotting PCA projection side-by-side with the bottleneck representation

plt.figure(figsize=(8,4))
plt.subplot(121)
plt.title('PCA')
plt.scatter(Zpca[:5000,0], Zpca[:5000,1], c=y_train[:5000], s=8, cmap='tab10')
plt.gca().get_xaxis().set_ticklabels([])
plt.gca().get_yaxis().set_ticklabels([])

plt.subplot(122)
plt.title('Autoencoder')
plt.scatter(Zenc[:5000,0], Zenc[:5000,1], c=y_train[:5000], s=8, cmap='tab10')
plt.gca().get_xaxis().set_ticklabels([])
plt.gca().get_yaxis().set_ticklabels([])

plt.tight_layout()

enter image description here

Reconstructions

And now let's look at the reconstructions (first row - original images, second row - PCA, third row - autoencoder):

plt.figure(figsize=(9,3))
toPlot = (x_train, Rpca, Renc)
for i in range(10):
    for j in range(3):
        ax = plt.subplot(3, 10, 10*j+i+1)
        plt.imshow(toPlot[j][i,:].reshape(28,28), interpolation="nearest", 
                   vmin=0, vmax=1)
        plt.gray()
        ax.get_xaxis().set_visible(False)
        ax.get_yaxis().set_visible(False)

plt.tight_layout()

enter image description here

One can obtain much better results with deeper network, some regularization, and longer training. Experiment. Deep learning is easy!