Solved – Is it really necessary to flip the kernel in a conv-net

backpropagationconv-neural-networkmachine learning

If I have a software implemented convolutional network and the weights are trained via back propagation, is it really necessary to "flip" the kernel? Afterall I am applying the "unflipped" weights (eg in a 3 x 3 filter) in the same order going both forwards and backwards?

Or have I misunderstood this? Here – look at the demo – http://cs231n.github.io/convolutional-networks/ – the filters appear to be applied, going forward, unflipped, yet the author says (unfortunately without any further discussion): "The backward pass for a convolution operation (for both the data and the weights) is also a convolution (but with spatially-flipped filters)."

Yet should they not also be flipped, if we are strictly following the correct procedure, going forward also?

My thinking is that, so long as I account for this properly in the "full" convolution, a flip is redundant.

But I dare say I am wrong!

Best Answer

Consider the 1D case: suppose we convolve the input L = [a b c d e] with filter [x y] resulting in output Q = [ax+by bx+cy cx+dy dx+ey ex+?] (We ignore the last entry to avoid thinking about padding). Now consider $\frac{dQ}{dc}$. It is [0 y x 0 0] -- the kernel is flipped.

If the error is E(Q), then consider $\frac{dE}{dc} = \frac{dE}{dQ}^T \frac{dQ}{dc}$. If we write $\frac{dE}{dQ}$ as a vector [f g h i j] then $\frac{dE}{dc} = gy + hx$. Extending this a bit, it is not hard to see that $\frac{dE}{db} = fy + gx$, and $\frac{dE}{dd} = hy + ix$. In general, $\frac{dE}{dL}$ is the convolution of $\frac{dE}{dQ}$ with the filter [y x].

Extending this to two dimensions means that instead of reversing the kernel, you flip it both ways, since the kernel is now a two dimensional matrix.

Formally, if the input is matrix $I$, the kernel is matrix $K$ of size $n$ by $n$, and the output is matrix $O$ with error being $E(O)$ such that we have the matrix $\frac{dE}{dO}$ then we can write the following:

$$ O_{xy} = \sum_{u,v = 0,0}^{n-1,n-1} I_{x+u, y+v} \cdot K_{u,v} $$

$$ \frac{dE}{dI_{ij}} = \sum_{x,y} \frac{dE}{O_{x,y}} \frac{dO_{x,y}}{dI_{ij}} $$

Now we can substitute in:

$$ \frac{dE}{dI_{ij}} = \sum_{x,y} \frac{dE}{O_{x,y}} \frac{d\left( \sum_{u,v = 0,0}^{n-1,n-1} I_{x+u, y+v} \cdot K_{u,v} \right)}{dI_{ij}} $$

Pulling the sum out

$$ \frac{dE}{dI_{ij}} = \sum_{x,y} \frac{dE}{O_{x,y}} \sum_{u,v = 0,0}^{n-1,n-1} K_{u,v} \frac{ dI_{x+u, y+v} }{dI_{ij}} $$

$$ \frac{dE}{dI_{ij}} = \sum_{x,y} \frac{dE}{O_{x,y}} \sum_{u,v = 0,0}^{n-1,n-1} K_{u,v} \cdot \mathbf{1}(x+u = i, y+v = j) $$

The indicator function term is only ever true if $x+u=i$ for some $u$ from $0$ to $n-1$. This means that $u-i$ is between $-i$ to $n-1-i$ so that $x = i-u$ is between $1+i-n$ and $i$. The analogous holds true for $y$

$$ \frac{dE}{dI_{ij}} = \sum_{x,y = 1+i-n, 1+j-n}^{i,j} \frac{dE}{O_{x,y}} K_{i-x,j-y} $$

Now set $x' = x-i+n-1$ and $y' = y-j+n-1$ to get

$$ \frac{dE}{dI_{ij}} = \sum_{x',y' = 0,0}^{n-1,n-1} \frac{dE}{O_{x'+i-n+1,y'+j-n+1}} K_{n-1-x',n-1-y'} $$

Notice that the subscripts on kernel $K$ are decreasing with $x'$ and $y'$, which indicates that the kernel is flipped for the convolution.

Related Solutions

Solved – Are there mathematical reasons for convolution in neural networks beyond expediency

There are no differences in what neural networks can do when they use convolution or correlation. This is because the filters are learned and if a CNN can learn to do a particular task using convolution operation, it can also learn to do the same task using correlation operation (It would learn the rotated version of each filter).

To find more details about the reasons that people sometimes find it more intuitive to think about convolution than correlation, this post may be useful.

There remains this question that if there is no difference between convolution and cross-correlation, what is the point of flipping the weights into the kernel matrix? I would like to include some sentences from the Deep learning book by Ian Goodfellow et al. to answer this question:

The only reason to flip the kernel is to obtain the commutative property. While the commutative property is useful for writing proofs, it is not usually an important property of a neural network implementation... Many machine learning libraries implement cross-correlation but call it convolution.

The takeaway is that although convolution is a favorite operation in classic machine vision applications, it is replaced by correlation in many of the implementations of the convolutional neural networks.

Solved – Convolution operator in CNN and how it differs from feed forward NN operation

The core idea about convolutional neural networks is that, contrary to fully-connected layers, instead to assigning different weights per each pixel of the picture (or something else), you have some kernel that is smaller then the input picture and slides through it. What follows, we apply same set of weights to different parts of the picture (so called weight sharing). By this we hope to detect same patterns in different parts of the image.

To illustrate this, let's look at one-dimensional kernel that slides through a vector (say, a sentence):

    g(x[0:2] * W + b) = z[0]
    /   |   \
 x[0] x[1] x[2] x[3] x[4] 

         g(x[1:3] * W + b) = z[1]
         /   |   \
 x[0] x[1] x[2] x[3] x[4] 

              g(x[2:4] * W + b) = z[2]
              /   |   \
 x[0] x[1] x[2] x[3] x[4]

As you can see, we have an input vector of length five $\boldsymbol{x} = (x_0,x_1,\dots,x_4)$ and apply same set of three weights $\boldsymbol{w} = (w_0, w_1, w_2)$ and bias term $b$. The convolution kernel slides through the vector by applying same weights to each part of the vector and produces output vector of length three $\boldsymbol{z} = (z_0, z_1, z_2)$, where each $z_i = g(\boldsymbol{x}_{i:i+2} \cdot \boldsymbol{w} + b) = g(x_i w_0 + x_{i+1} w_1 + x_{i+2} w_2 + b)$.

So basically, it applies the same operator, but at smaller scale, going through the input tensor part-by-part while sharing the weights.

You can find this tutorial and recorded lectures by the Stanford CS231n staff helpful.

Best Answer

Related Solutions

Solved – Are there mathematical reasons for convolution in neural networks beyond expediency

Solved – Convolution operator in CNN and how it differs from feed forward NN operation

Related Question