What is the difference between the terms "kernel" and "filter" in the context of convolutional neural networks?

# Solved – Difference between “kernel” and “filter” in CNN

conv-neural-networkdeep learningneural networksterminology

#### Related Solutions

**Autoencoder** is a simple 3-layer neural network where *output units* are directly connected *back to input units*. E.g. in a network like this:

`output[i]`

has edge back to `input[i]`

for every `i`

. Typically, number of hidden units is much less then number of visible (input/output) ones. As a result, when you pass data through such a network, it first compresses (encodes) input vector to "fit" in a smaller representation, and then tries to reconstruct (decode) it back. The task of training is to minimize an error or reconstruction, i.e. find the most efficient compact representation (encoding) for input data.

**RBM** shares similar idea, but uses stochastic approach. Instead of deterministic (e.g. logistic or ReLU) it uses stochastic units with particular (usually binary of Gaussian) distribution. Learning procedure consists of several steps of Gibbs sampling (propagate: sample hiddens given visibles; reconstruct: sample visibles given hiddens; repeat) and adjusting the weights to minimize reconstruction error.

Intuition behind RBMs is that there are some visible random variables (e.g. film reviews from different users) and some hidden variables (like film genres or other internal features), and the task of training is to find out how these two sets of variables are actually connected to each other (more on this example may be found here).

**Convolutional Neural Networks** are somewhat similar to these two, but instead of learning single global weight matrix between two layers, they aim to find a set of locally connected neurons. CNNs are mostly used in image recognition. Their name comes from "convolution" operator or simply "filter". In short, filters are an easy way to perform complex operation by means of simple change of a convolution kernel. Apply Gaussian blur kernel and you'll get it smoothed. Apply Canny kernel and you'll see all edges. Apply Gabor kernel to get gradient features.

(image from here)

The goal of convolutional neural networks is not to use one of predefined kernels, but instead to *learn data-specific kernels*. The idea is the same as with autoencoders or RBMs - translate many low-level features (e.g. user reviews or image pixels) to the compressed high-level representation (e.g. film genres or edges) - but now weights are learned only from neurons that are spatially close to each other.

All three models have their use cases, pros and cons, but probably the most important properties are:

- Autoencoders are simplest ones. They are intuitively understandable, easy to implement and to reason about (e.g. it's much easier to find good meta-parameters for them than for RBMs).
- RBMs are generative. That is, unlike autoencoders that only discriminate some data vectors in favour of others, RBMs can also generate new data with given joined distribution. They are also considered more feature-rich and flexible.
- CNNs are very specific model that is mostly used for very specific task (though pretty popular task). Most of the top-level algorithms in image recognition are somehow based on CNNs today, but outside that niche they are hardly applicable (e.g. what's the reason to use convolution for film review analysis?).

**UPD.**

**Dimensionality reduction**

When we represent some object as a vector of $n$ elements, we say that this is a vector in $n$-dimensional space. Thus, **dimensionality reduction** refers to a process of refining data in such a way, that each data vector $x$ is translated into another vector $x'$ in an $m$-dimensional space (vector with $m$ elements), where $m < n$. Probably the most common way of doing this is PCA. Roughly speaking, PCA finds "internal axes" of a dataset (called "components") and sorts them by their importance. First $m$ most important components are then used as new basis. Each of these components may be thought of as a high-level feature, describing data vectors better than original axes.

Both - autoencoders and RBMs - do the same thing. Taking a vector in $n$-dimensional space they translate it into an $m$-dimensional one, trying to keep as much important information as possible and, at the same time, remove noise. If training of autoencoder/RBM was successful, each element of resulting vector (i.e. each hidden unit) represents something important about the object - shape of an eyebrow in an image, genre of a film, field of study in scientific article, etc. You take lots of noisy data as an input and produce much less data in a much more efficient representation.

**Deep architectures**

So, if we already had PCA, why the hell did we come up with autoencoders and RBMs? It turns out that PCA only allows **linear transformation** of a data vectors. That is, having $m$ principal components $c_1..c_m$, you can represent only vectors $x=\sum_{i=1}^{m}w_ic_i$. This is pretty good already, but not always enough. No matter, how many times you will apply PCA to a data - relationship will always stay linear.

Autoencoders and RBMs, on other hand, are non-linear by the nature, and thus, they can learn more complicated relations between visible and hidden units. Moreover, they can be **stacked**, which makes them even more powerful. E.g. you train RBM with $n$ visible and $m$ hidden units, then you put another RBM with $m$ visible and $k$ hidden units on top of the first one and train it too, etc. And exactly the same way with autoencoders.

But you don't just add new layers. On each layer you try to learn best possible representation for a data from the previous one:

On the image above there's an example of such a deep network. We start with ordinary pixels, proceed with simple filters, then with face elements and finally end up with entire faces! This is the essence of **deep learning**.

Now note, that at this example we worked with image data and sequentially took
larger and larger areas of spatially close pixels. Doesn't it sound similar? Yes, because it's an example of deep *convolutional* network. Be it based on autoencoders or RBMs, it uses convolution to stress importance of locality. That's why CNNs are somewhat distinct from autoencoders and RBMs.

**Classification**

None of models mentioned here work as classification algorithms per se. Instead, they are used for **pretraining** - learning transformations from low-level and hard-to-consume representation (like pixels) to a high-level one. Once deep (or maybe not that deep) network is pretrained, input vectors are transformed to a better representation and resulting vectors are finally passed to real classifier (such as SVM or logistic regression). In an image above it means that at the very bottom there's one more component that actually does classification.

Many recent effective CNN structures use small filters that preserve the spatial resolution, for example the VGG network and the 100-layer residual network.

I think most importantly having the same input and output size allows for simply stacking up more layers without affecting(decreasing) the spatial resolution, so that we can build deeper networks.

Moreover, with such spacial consistency we can, add some operation between the input and output of a set of layers, as in the residual network,

Formally, denoting the desired underlying mapping as $H(x)$, we let the stacked nonlinear layers fit another mapping of $F(x) := H(x)−x$.

or concatenate the output from filters of different sizes, as in the inception network.

###### Related Question

- Solved – How to initialize the elements of the filter matrix
- Solved – the difference between stride and subsample in convolutional neural networks
- CNN – How to Work with Multiple Filter Region Sizes: 2, 3, and 4
- Solved – difference between ‘transfer learning’ and ‘domain adaptation’
- Solved – the difference between ‘same’ and ‘half’ border mode in Convolutional neural networks
- Solved – Connection between filters and feature map in CNN

## Best Answer

In the context of convolutional neural networks, kernel = filter = feature detector.

Here is a great illustration from Stanford's deep learning tutorial (also nicely explained by Denny Britz).

The filter is the yellow sliding window, and its value is:

\begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}