Solved – How does a Stacked AutoEncoder increases performance of a Convolutional Neural Network in image classification tasks

autoencoderscomputer visionconv-neural-networkdeep learningneural networks

Stacked Auto-Endocer provides a version of raw data with more promising feature information, that can be used to train a classier with a specific context and find better accuracy than training with raw data . For image classification task, how can a stacked auto-encoder help an traditional Convolutional Neural Network? I recently read the paper http://people.idsia.ch/~ciresan/data/icann2011.pdf but can't understand clearly.

I have a little bit question about CNN, any pre-trained step before first convolution operation like Dimensionally Reduction or AutoEncoder output can be used as input image instead of real image data in CNN and how much it affects the performance of Convolution Neural Network in context of image classification tasks.

Best Answer

For image classification task, how can a stacked auto-encoder help an traditional Convolutional Neural Network?

As mentioned in the paper, we can use the pre-trained weights to initialize CNN layers, although that essentially doesn't add anything to the CNN, it normally helps setting a good starting point for training (especially when there's insufficient amount of labeled data).

any pre-trained step before first convolution operation like Dimensionally Reduction or AutoEncoder output can be used as input image instead of real image data in CNN

Becaues of CNN's local connectivity, if the topology of data is lost after dimensionality reduction, then CNNs would no longer be appropriate.

For example, suppose our data are images, if we see each pixel as a dimension, and use PCA to do dimensionality reduction, then the new representation of a image will be a vector and no longer preserves the original 2D topology (and correlation between adjacent pixels). So in this case it can not be used directly with 2D CNNs (there are ways to recover the topology though).

Using the AutoEncoder output should work well with CNNs, as it can be seen as adding an additional layer (with fixed parameters) between the CNN and the input.

how much it affects the performance of Convolution Neural Network in context of image classification tasks

I happened to have done a related project at college, where I tried to label each part of an image as road, sky or else. Although the results are far from satisfactory, it might give some ideas about how those pre-processing techniques affects the performance.

an image of road outcome of a simple CNN (1) image of a clear road (2) outcome of a simple two-layer CNN

outcome of a simple CNN with first layer initialized by CAE outcome of a simple CNN with ZCA pre-processing (3) CNN with first layer initialized by pre-trained CAE (4) CNN with ZCA whitening

The CNNs are trained using SGD with fixed learning rates. Tested on the KITTI road category data set, the error rate of method (2) is around 14%, and the error rates of method (3) and (4) are around 12%.

Please correct me where I'm wrong. :)