For image classification task, how can a stacked auto-encoder help an traditional Convolutional Neural Network?
As mentioned in the paper, we can use the pre-trained weights to initialize CNN layers, although that essentially doesn't add anything to the CNN, it normally helps setting a good starting point for training (especially when there's insufficient amount of labeled data).
any pre-trained step before first convolution operation like Dimensionally Reduction or AutoEncoder output can be used as input image instead of real image data in CNN
Becaues of CNN's local connectivity, if the topology of data is lost after dimensionality reduction, then CNNs would no longer be appropriate.
For example, suppose our data are images, if we see each pixel as a dimension, and use PCA to do dimensionality reduction, then the new representation of a image will be a vector and no longer preserves the original 2D topology (and correlation between adjacent pixels). So in this case it can not be used directly with 2D CNNs (there are ways to recover the topology though).
Using the AutoEncoder output should work well with CNNs, as it can be seen as adding an additional layer (with fixed parameters) between the CNN and the input.
how much it affects the performance of Convolution Neural Network in context of image classification tasks
I happened to have done a related project at college, where I tried to label each part of an image as road, sky or else. Although the results are far from satisfactory, it might give some ideas about how those pre-processing techniques affects the performance.
(1) image of a clear road (2) outcome of a simple two-layer CNN
(3) CNN with first layer initialized by pre-trained CAE (4) CNN with ZCA whitening
The CNNs are trained using SGD with fixed learning rates. Tested on the KITTI road category data set, the error rate of method (2) is around 14%, and the error rates of method (3) and (4) are around 12%.
Please correct me where I'm wrong. :)
Hyperparameter choice is something that can't really be answered: sure, there are some set of procedures that can be followed, but it's largely a case of hit and trial.
Single DA's can indeed extract meaningful features, however, most of the features in case of encoding dimensions 'L' (say) > input dimensions 'D' (i.e. Overcomplete learning) will end up being random noise.
The reason for your autoencoder not learning meaningful features is because given the degree of freedom the autoencoder has in the encoding layer (i.e. L > D
) it becomes quite easy for the autoencoder to learn an identity mapping of the input.
So to alleviate this problem, you have to put additional constraints in order to limit this degree of freedom.
I believe you can try the following and see what the outcome is:
The first and probably the easiest step would be to try and reduce the number of encoding layer nodes from 1000 to something little closer to the dimensions of the input, ie. 784. I would say 800 would be a good start. Visualize the features then and see if some features have improved.
Apply additional regularization constraints, say l2 regularization on the weights (and if already doing that, increase the penalty term corresponding to l2) and other such penalization techniques.
Tied weights. Use tied weights on the encoding layer and the decoding layer if not doing so already. ie.
W_decoding = W_encoding.T
.
When not using tied weights, many times, either of the two layers learn larger, better weights (for the lack of words) and compensate for the poor weights learned by the other. By placing this constraint we force the autoencoder to learn a balanced set of weights. Also, it often results in improvement of training time as well as a pretty good limitation on the degree of freedom (the number of free, trainable parameters is halved!).
Give this a try. Might help.
Best Answer
I don't know why this was downvoted, but I figured out the answer though it may be obvious.
The training set is used to train one compression/encoder layer by learning to approximate itself using the training set.
Once this is done, the weights / layer that is responsible for the encoding part is saved and paired with a classification layer (e.g. softmax layer) to learn a supervised classifier. This is done by using the same training set as before and fitting them with labels / classes of this training set that weren't used previously.
After the classifier is trained, it can be used to make predictions or check performance using the test set.
For example, if you already had an autoencoder trained and wanted to use the encoding layer with a softmax layer, you could do the following with keras:
In the stacked autoencoder case, the procedure is the same except with more encoding layers. Discussion about this using keras here and here.