Solved – RGB images as input to CNN

conv-neural-networkconvolutionmachine learning

Considering a 32*32*3 RGB image, would there be filters/kernels for each color channel? I haven't found examples explaining how CNN works for RGB images and whether each filter is a 3D. If I decide to have 4 filters where each filter is of size 5*5, then would each of the 4 filters be of size 5*5*3?

http://cs231n.github.io/convolutional-networks/ Tutorial is a good starting point, but I don't understand how the third layer w1 is computed and how come the last output layer is of volume 2?
image

Can somebody please help with the math? This is what I could understand:
for the feature map (pink) w1[:,:,0] the element 0 in top left corner is obtained by convolution of the first blue and red filters as 0*0+0*(-1)+0*(-1)+0*0+2*0+2*(-1)+0*0+0*1+2*1 = -2+2=0

Continuing this way, all the element if w[:,:,0] are obtained. Now convolving with w0[:,:,0] and w1[:,:,0] to get 0*0+(-1)*1+(-1)*1+0*1+0*0+(-1)*0+0*(-1)+1*(-1)+1*(-1) = -1-1-1-1 = -4 which is different than the element inside the green square on top left corner of o[:,:,0]

Then how is the first green square, o[:,:,0] obtained and why it is of depth 2 is unclear to me. I am getting -4 instead of 3 as the first element of o[:,:,0]

Best Answer

See the output of a Convolutional layer, from Torch documentation where all channels in the input contribute to all channels in the output:

$$\text{O}[i][j][k] = \text{bias}[k] + \sum_l \sum_{s=1}^{kW} \sum_{t=1}^{kH} \text{weight}[s][t][l][k] * \text{I}[dW\cdot(i-1)+s)][dH\cdot(j-1)+t][l]$$

the output image size will be nOutputPlane x oheight x owidth [...]

In you example, $k\in[1,2]$, $l\in[1,3]$. That explains the dimensionality of the weight tensor.

The reason is simple, you want to go from 3 channels (planes in Torch nomenclature) to 2 channels. As each output channel depends on all input channels, then you need three separate filters acting on each input channel for each output channel, totaling 6 filters in total.

The values in the output are given by the convolution (the sum of the elements of the Hadamard product matrices) between input channels and convolutional filters.

For example:

$$\begin{pmatrix} 0 & 0 & 0\\ 0 & 2 & 2\\ 0 & 0 & 2 \end{pmatrix} * \begin{pmatrix} 0 & -1 & -1\\ 0 & 0 & -1\\ 0 & 1 & 1 \end{pmatrix} = \text{sum}\begin{pmatrix} 0 & 0 & 0\\ 0 & 0 & -2\\ 0 & 0 & 2 \end{pmatrix}=0 \\ \begin{pmatrix} 0 & 0 & 0\\ 0 & 1 & 1\\ 0 & 2 & 2 \end{pmatrix} * \begin{pmatrix} 0 & 0 & 1\\ 0 & -1 & 0\\ -1 & 0 & 1 \end{pmatrix} = \text{sum}\begin{pmatrix} 0 & 0 & 0\\ 0 & -1 & 0\\ 0 & 0 & 2 \end{pmatrix}=1 \\ \begin{pmatrix} 0 & 0 & 0\\ 0 & 1 & 0\\ 0 & 1 & 1 \end{pmatrix} * \begin{pmatrix} -1 & 0 & 0\\ -1 & 1 & 1\\ 0 & 0 & 0 \end{pmatrix} = \text{sum}\begin{pmatrix} 0 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 0 \end{pmatrix}=1$$

So, using our equation to find the first value in $\text{O}[1][1][1]$

$$\text{O}[1][1][1] = \text{bias}[1] + \sum_l \sum_{s=1}^{kW} \sum_{t=1}^{kH} \text{weight}[s][t][l][1] * \text{I}[s][t][l]=1+0+1+1 = 3$$

So we recover the result from the example.


Continuing...

Can somebody please help with the math? This is what I could understand: for the feature map (pink) w1[:,:,0] the element 0 in top left corner is obtained by convolution of the first blue and red filters as 0*0+0*(-1)+0*(-1)+0*0+2*0+2*(-1)+0*0+0*1+2*1 = -2+2=0

No, you misunderstood. The weights (red) are used to obtain the output (green), you do not convolve a weight with another.

Continuing this way, all the element if w[:,:,0] are obtained. Now convolving with w0[:,:,0] and w1[:,:,0] to get 0*0+(-1)*1+(-1)*1+0*1+0*0+(-1)*0+0*(-1)+1*(-1)+1*(-1) = -1-1-1-1 = -4 which is different than the element inside the green square on top left corner of o[:,:,0]

The second column of weights are also used on the input. They give you the second channel of the output.

Then how is the first green square, o[:,:,0] obtained and why it is of depth 2 is unclear to me. I am getting -4 instead of 3 as the first element of o[:,:,0]

You can see I sorted this already in the example above. You only use the first column of weights to obtain the first channel of the output.

Related Question