Solved – the difference between stride and subsample in convolutional neural networks

conv-neural-networkdeep learningneural networksterminology

Is there any difference between stride and subsample in convolutional neural networks?

Best Answer

Remark

Any strided convolutional or subsampling layer does achieve downsampling of the input. So, in a way, refering to "subsampling" in the context of a convolutional layer probably signifies using stride. strides for Convolutional2D layers was called subsample in Keras 1.2.2, for example (now it has been changed to simply strides in Conv2D).

On Subsampling, Mean Pooling, and Convolutional layers

If you mean "subsample" as in the use of Subsampling layers, Subsampling is a generalization of Mean Pooling, with learnable weights. It may, or may not use striding.

See the output in Torch for SpatialSubsamplng:

$$\text{output}[i][j][k] = \text{bias}[k] + \text{weight}[k] \sum_{s=1}^{kW} \sum_{t=1}^{kH} \text{input[}dW\cdot(i-1)+s)][dH\cdot(j-1)+t][k]$$

The output has the same number of channels $k$ as the input.

And here's the output of SpatialAveragePooling

$$\text{output}[i][j][k] = {1\over kW\cdot kH} \sum_{s=1}^{kW} \sum_{t=1}^{kH} \text{input[}dW\cdot(i-1)+s)][dH\cdot(j-1)+t][k]$$

The only difference between both is that Subsampling allows different weights per channel. But a channel from the input maps to the same channel in the output.

Compare it to the output of SpatialConvolution, where all channels in the input contribute to all channels in the output:

$$\text{output}[i][j][k] = \text{bias}[k] + \sum_l \sum_{s=1}^{kW} \sum_{t=1}^{kH} \text{weight}[s][t][l][k] \cdot \text{input}[dW\cdot(i-1)+s)][dH\cdot(j-1)+t][l]$$

So, that's it, Subsampling is not generally equivalent to strided Convolution, because their mappings are different.

Related Solutions

Neural Networks – Difference Between Convolutional Neural Networks, Restricted Boltzmann Machines, and Autoencoders

Autoencoder is a simple 3-layer neural network where output units are directly connected back to input units. E.g. in a network like this:

enter image description here

output[i] has edge back to input[i] for every i. Typically, number of hidden units is much less then number of visible (input/output) ones. As a result, when you pass data through such a network, it first compresses (encodes) input vector to "fit" in a smaller representation, and then tries to reconstruct (decode) it back. The task of training is to minimize an error or reconstruction, i.e. find the most efficient compact representation (encoding) for input data.

RBM shares similar idea, but uses stochastic approach. Instead of deterministic (e.g. logistic or ReLU) it uses stochastic units with particular (usually binary of Gaussian) distribution. Learning procedure consists of several steps of Gibbs sampling (propagate: sample hiddens given visibles; reconstruct: sample visibles given hiddens; repeat) and adjusting the weights to minimize reconstruction error.

enter image description here

Intuition behind RBMs is that there are some visible random variables (e.g. film reviews from different users) and some hidden variables (like film genres or other internal features), and the task of training is to find out how these two sets of variables are actually connected to each other (more on this example may be found here).

Convolutional Neural Networks are somewhat similar to these two, but instead of learning single global weight matrix between two layers, they aim to find a set of locally connected neurons. CNNs are mostly used in image recognition. Their name comes from "convolution" operator or simply "filter". In short, filters are an easy way to perform complex operation by means of simple change of a convolution kernel. Apply Gaussian blur kernel and you'll get it smoothed. Apply Canny kernel and you'll see all edges. Apply Gabor kernel to get gradient features.

enter image description here

(image from here)

The goal of convolutional neural networks is not to use one of predefined kernels, but instead to learn data-specific kernels. The idea is the same as with autoencoders or RBMs - translate many low-level features (e.g. user reviews or image pixels) to the compressed high-level representation (e.g. film genres or edges) - but now weights are learned only from neurons that are spatially close to each other.

enter image description here

All three models have their use cases, pros and cons, but probably the most important properties are:

Autoencoders are simplest ones. They are intuitively understandable, easy to implement and to reason about (e.g. it's much easier to find good meta-parameters for them than for RBMs).
RBMs are generative. That is, unlike autoencoders that only discriminate some data vectors in favour of others, RBMs can also generate new data with given joined distribution. They are also considered more feature-rich and flexible.
CNNs are very specific model that is mostly used for very specific task (though pretty popular task). Most of the top-level algorithms in image recognition are somehow based on CNNs today, but outside that niche they are hardly applicable (e.g. what's the reason to use convolution for film review analysis?).

UPD.

Dimensionality reduction

When we represent some object as a vector of $n$ elements, we say that this is a vector in $n$-dimensional space. Thus, dimensionality reduction refers to a process of refining data in such a way, that each data vector $x$ is translated into another vector $x'$ in an $m$-dimensional space (vector with $m$ elements), where $m < n$. Probably the most common way of doing this is PCA. Roughly speaking, PCA finds "internal axes" of a dataset (called "components") and sorts them by their importance. First $m$ most important components are then used as new basis. Each of these components may be thought of as a high-level feature, describing data vectors better than original axes.

Both - autoencoders and RBMs - do the same thing. Taking a vector in $n$-dimensional space they translate it into an $m$-dimensional one, trying to keep as much important information as possible and, at the same time, remove noise. If training of autoencoder/RBM was successful, each element of resulting vector (i.e. each hidden unit) represents something important about the object - shape of an eyebrow in an image, genre of a film, field of study in scientific article, etc. You take lots of noisy data as an input and produce much less data in a much more efficient representation.

Deep architectures

So, if we already had PCA, why the hell did we come up with autoencoders and RBMs? It turns out that PCA only allows linear transformation of a data vectors. That is, having $m$ principal components $c_1..c_m$, you can represent only vectors $x=\sum_{i=1}^{m}w_ic_i$. This is pretty good already, but not always enough. No matter, how many times you will apply PCA to a data - relationship will always stay linear.

Autoencoders and RBMs, on other hand, are non-linear by the nature, and thus, they can learn more complicated relations between visible and hidden units. Moreover, they can be stacked, which makes them even more powerful. E.g. you train RBM with $n$ visible and $m$ hidden units, then you put another RBM with $m$ visible and $k$ hidden units on top of the first one and train it too, etc. And exactly the same way with autoencoders.

But you don't just add new layers. On each layer you try to learn best possible representation for a data from the previous one:

enter image description here

On the image above there's an example of such a deep network. We start with ordinary pixels, proceed with simple filters, then with face elements and finally end up with entire faces! This is the essence of deep learning.

Now note, that at this example we worked with image data and sequentially took larger and larger areas of spatially close pixels. Doesn't it sound similar? Yes, because it's an example of deep convolutional network. Be it based on autoencoders or RBMs, it uses convolution to stress importance of locality. That's why CNNs are somewhat distinct from autoencoders and RBMs.

Classification

None of models mentioned here work as classification algorithms per se. Instead, they are used for pretraining - learning transformations from low-level and hard-to-consume representation (like pixels) to a high-level one. Once deep (or maybe not that deep) network is pretrained, input vectors are transformed to a better representation and resulting vectors are finally passed to real classifier (such as SVM or logistic regression). In an image above it means that at the very bottom there's one more component that actually does classification.

Solved – Difference between “kernel” and “filter” in CNN

In the context of convolutional neural networks, kernel = filter = feature detector.

Here is a great illustration from Stanford's deep learning tutorial (also nicely explained by Denny Britz).

The filter is the yellow sliding window, and its value is:

\begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}

Best Answer

Remark

On Subsampling, Mean Pooling, and Convolutional layers

Related Solutions

Neural Networks – Difference Between Convolutional Neural Networks, Restricted Boltzmann Machines, and Autoencoders

Solved – Difference between “kernel” and “filter” in CNN

Related Question