If I understand you correctly, the question is how to train the net if you have pooling layers? Well, the weights in pooling layers are not that different from the ones in "normal" layers. Imagine you have a max pooling layer with grid size 3x3. Imagine further that for a given training example, pixel number 5 (that is, in position (2,2) ) has had the max value in forward propagation, i.e. its value has been passed through the max pooling layer. When doing backprop for that sample, the weight between your pixel number 5 and the output of the pooling is simply one, while for the other eight pixels it is zero. And since the max pooling does not do any further transformation, the error used is that from the layer that came after the max pooling layer. For a more mathematical formulation, there is a nice website: http://andrew.gibiansky.com/blog/machine-learning/convolutional-neural-networks/
I'll first try to share some intuition behind CNN and then comment the particular topics you listed.
The convolution and sub-sampling layers in a CNN are not different from the hidden layers in a common MLP, i. e. their function is to extract features from their input. These features are then given to the next hidden layer to extract still more complex features, or are directly given to a standard classifier to output the final prediction (usually a Softmax, but also SVM or any other can be used). In the context of image recognition, these features are images treats, like stroke patterns in the lower layers and object parts in the upper layers.
In natural images these features tend to be the same at all locations. Recognizing a certain stroke pattern in the middle of the images will be as useful as recognizing it close to the borders. So why don't we replicate the hidden layers and connect multiple copies of it in all regions of the input image, so the same features can be detected anywhere? It's exactly what a CNN does, but in a efficient way. After the replication (the "convolution" step) we add a sub-sample step, which can be implemented in many ways, but is nothing more than a sub-sample. In theory this step could be even removed, but in practice it's essential in order to allow the problem remain tractable.
Thus:
- Correct.
- As explained above, hidden layers of a CNN are feature extractors as in a regular MLP. The alternated convolution and sub-sampling steps are done during the training and classification, so they are not something done "before" the actual processing. I wouldn't call them "pre-processing", the same way the hidden layers of a MLP is not called so.
- Correct.
A good image which helps to understand the convolution is CNN page in the ULFDL tutorial. Think of a hidden layer with a single neuron which is trained to extract features from $3 \times 3$ patches. If we convolve this single learned feature over a $5 \times 5$ image, this process can be represented by the following gif:
![enter image description here](https://i.stack.imgur.com/I7DBr.gif)
In this example we were using a single neuron in our feature extraction layer, and we generated $9$ convolved features. If we had a larger number of units in the hidden layer, it would be clear why the sub-sampling step after this is required.
The subsequent convolution and sub-sampling steps are based in the same principle, but computed over features extracted in the previous layer, instead of the raw pixels of the original image.
Best Answer
As mentioned in the paper, we can use the pre-trained weights to initialize CNN layers, although that essentially doesn't add anything to the CNN, it normally helps setting a good starting point for training (especially when there's insufficient amount of labeled data).
Becaues of CNN's local connectivity, if the topology of data is lost after dimensionality reduction, then CNNs would no longer be appropriate.
For example, suppose our data are images, if we see each pixel as a dimension, and use PCA to do dimensionality reduction, then the new representation of a image will be a vector and no longer preserves the original 2D topology (and correlation between adjacent pixels). So in this case it can not be used directly with 2D CNNs (there are ways to recover the topology though).
Using the AutoEncoder output should work well with CNNs, as it can be seen as adding an additional layer (with fixed parameters) between the CNN and the input.
I happened to have done a related project at college, where I tried to label each part of an image as road, sky or else. Although the results are far from satisfactory, it might give some ideas about how those pre-processing techniques affects the performance.
The CNNs are trained using SGD with fixed learning rates. Tested on the KITTI road category data set, the error rate of method (2) is around 14%, and the error rates of method (3) and (4) are around 12%.
Please correct me where I'm wrong. :)