CNN – How to Work with Multiple Filter Region Sizes: 2, 3, and 4

classificationconv-neural-networkconvolutiondeep learningmachine learning

I mention learn convolutional neural networks (CNN) for classification of sentences made by Yoonkim.

I am still confused about the size of the filter and how convolution works .

What do filter_h = 5 with filter_hs = [3,4,5], whether filter_h is the maximum length for each filter_hs?? how it works?

To get the image shape, the longest maximum sentence in this case is 56 so 56 + 2 * (5-1) = 64 .., What does number 2 mean? Where is number 2 obtained?

Best Answer

I am still confused about the size of the filter, how convolution it works

Here is a great illustration from Stanford's deep learning tutorial (also nicely explained by Denny Britz).

The filter is the yellow sliding window, and its value is:

\begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}

Another neat visualization of convolutions (and deconvolutions a.k.a. transposed convolutions): https://github.com/vdumoulin/conv_arithmetic

to get the image shape, the longest maximum sentence in this case is 56 so 56 + 2 * (5-1) = 64 .., what does it mean number 2 ?? where number 2 is obtained?

Padding on the left and on the right of the sentence.

Related Solutions

CNNs – What the Convolution Step Does in a Convolutional Neural Network

I'll first try to share some intuition behind CNN and then comment the particular topics you listed.

The convolution and sub-sampling layers in a CNN are not different from the hidden layers in a common MLP, i. e. their function is to extract features from their input. These features are then given to the next hidden layer to extract still more complex features, or are directly given to a standard classifier to output the final prediction (usually a Softmax, but also SVM or any other can be used). In the context of image recognition, these features are images treats, like stroke patterns in the lower layers and object parts in the upper layers.

In natural images these features tend to be the same at all locations. Recognizing a certain stroke pattern in the middle of the images will be as useful as recognizing it close to the borders. So why don't we replicate the hidden layers and connect multiple copies of it in all regions of the input image, so the same features can be detected anywhere? It's exactly what a CNN does, but in a efficient way. After the replication (the "convolution" step) we add a sub-sample step, which can be implemented in many ways, but is nothing more than a sub-sample. In theory this step could be even removed, but in practice it's essential in order to allow the problem remain tractable.

Thus:

Correct.
As explained above, hidden layers of a CNN are feature extractors as in a regular MLP. The alternated convolution and sub-sampling steps are done during the training and classification, so they are not something done "before" the actual processing. I wouldn't call them "pre-processing", the same way the hidden layers of a MLP is not called so.
Correct.

A good image which helps to understand the convolution is CNN page in the ULFDL tutorial. Think of a hidden layer with a single neuron which is trained to extract features from $3 \times 3$ patches. If we convolve this single learned feature over a $5 \times 5$ image, this process can be represented by the following gif:

enter image description here

In this example we were using a single neuron in our feature extraction layer, and we generated $9$ convolved features. If we had a larger number of units in the hidden layer, it would be clear why the sub-sampling step after this is required.

The subsequent convolution and sub-sampling steps are based in the same principle, but computed over features extracted in the previous layer, instead of the raw pixels of the original image.

CNN Convolutional Operators – How to Determine the Number

I'm assuming that when you say 11x11x10 you mean that you have a layer with 10, 11x11 filters. So the number of convolutions that you'll be doing is simply 10, 2D discrete convolution per filter in your filter bank. So, let's say that you have a network:

480x480x1    # your input image of 1 channel
11x11x10     # your first filter bank of 10, 11x11 filters
5x5x20       # your second filter bank of 20, 5x5 filters
4x4x100      # your final filter bank of 100, 4x4 filters

You're going to be doing: $10 + 20 + 100 = 130$ multi channel 2D convolutions each with a depth of 1, 10, and 20 respectively. As you can see, the depth of each convolution is going to change as a function of the depth of the input volume from the previous layer.

But I assumed that you're trying to figure out how to compare this to a single channel 2D convolution. Well, you could just multiply the depth of each input volume by the number of filters in each layer and add them together. In your case: $10 + 200 + 2000 = 2,210$.

Now this only tells you how many single channel 2D convolutions you're doing, not how computationally intensive each convolution is, the computational intensity of each convolution will depend on a variety of parameters including image_size, image_depth, filter_size, your stride (how far you step between each individual filter calculation), the number of pooling layers you have, etc.

Best Answer

Related Solutions

CNNs – What the Convolution Step Does in a Convolutional Neural Network

CNN Convolutional Operators – How to Determine the Number

Related Question