Solved – Is it really necessary to flip the kernel in a conv-net

backpropagationconv-neural-networkmachine learning

If I have a software implemented convolutional network and the weights are trained via back propagation, is it really necessary to "flip" the kernel? Afterall I am applying the "unflipped" weights (eg in a 3 x 3 filter) in the same order going both forwards and backwards?

Or have I misunderstood this? Here – look at the demo – http://cs231n.github.io/convolutional-networks/ – the filters appear to be applied, going forward, unflipped, yet the author says (unfortunately without any further discussion): "The backward pass for a convolution operation (for both the data and the weights) is also a convolution (but with spatially-flipped filters)."

Yet should they not also be flipped, if we are strictly following the correct procedure, going forward also?

My thinking is that, so long as I account for this properly in the "full" convolution, a flip is redundant.

But I dare say I am wrong!

Best Answer

Consider the 1D case: suppose we convolve the input L = [a b c d e] with filter [x y] resulting in output Q = [ax+by bx+cy cx+dy dx+ey ex+?] (We ignore the last entry to avoid thinking about padding). Now consider $\frac{dQ}{dc}$. It is [0 y x 0 0] -- the kernel is flipped.

If the error is E(Q), then consider $\frac{dE}{dc} = \frac{dE}{dQ}^T \frac{dQ}{dc}$. If we write $\frac{dE}{dQ}$ as a vector [f g h i j] then $\frac{dE}{dc} = gy + hx$. Extending this a bit, it is not hard to see that $\frac{dE}{db} = fy + gx$, and $\frac{dE}{dd} = hy + ix$. In general, $\frac{dE}{dL}$ is the convolution of $\frac{dE}{dQ}$ with the filter [y x].

Extending this to two dimensions means that instead of reversing the kernel, you flip it both ways, since the kernel is now a two dimensional matrix.

Formally, if the input is matrix $I$, the kernel is matrix $K$ of size $n$ by $n$, and the output is matrix $O$ with error being $E(O)$ such that we have the matrix $\frac{dE}{dO}$ then we can write the following:

$$ O_{xy} = \sum_{u,v = 0,0}^{n-1,n-1} I_{x+u, y+v} \cdot K_{u,v} $$

$$ \frac{dE}{dI_{ij}} = \sum_{x,y} \frac{dE}{O_{x,y}} \frac{dO_{x,y}}{dI_{ij}} $$

Now we can substitute in:

$$ \frac{dE}{dI_{ij}} = \sum_{x,y} \frac{dE}{O_{x,y}} \frac{d\left( \sum_{u,v = 0,0}^{n-1,n-1} I_{x+u, y+v} \cdot K_{u,v} \right)}{dI_{ij}} $$

Pulling the sum out

$$ \frac{dE}{dI_{ij}} = \sum_{x,y} \frac{dE}{O_{x,y}} \sum_{u,v = 0,0}^{n-1,n-1} K_{u,v} \frac{ dI_{x+u, y+v} }{dI_{ij}} $$

$$ \frac{dE}{dI_{ij}} = \sum_{x,y} \frac{dE}{O_{x,y}} \sum_{u,v = 0,0}^{n-1,n-1} K_{u,v} \cdot \mathbf{1}(x+u = i, y+v = j) $$

The indicator function term is only ever true if $x+u=i$ for some $u$ from $0$ to $n-1$. This means that $u-i$ is between $-i$ to $n-1-i$ so that $x = i-u$ is between $1+i-n$ and $i$. The analogous holds true for $y$

$$ \frac{dE}{dI_{ij}} = \sum_{x,y = 1+i-n, 1+j-n}^{i,j} \frac{dE}{O_{x,y}} K_{i-x,j-y} $$

Now set $x' = x-i+n-1$ and $y' = y-j+n-1$ to get

$$ \frac{dE}{dI_{ij}} = \sum_{x',y' = 0,0}^{n-1,n-1} \frac{dE}{O_{x'+i-n+1,y'+j-n+1}} K_{n-1-x',n-1-y'} $$

Notice that the subscripts on kernel $K$ are decreasing with $x'$ and $y'$, which indicates that the kernel is flipped for the convolution.