Machine Learning – How Do CNN Filters Learn from Back-Propagation?

conv-neural-networkmachine learningneural networks

I have some intermediate knowledge of Image-Classification using convolutional neural networks. I'm pretty aware to concepts like 'gradient descent, 'derivatives', 'back-propagation & 'weight update process'.

I know that filters are randomly initialized and during training they are learned. But I really can't digest the concept of derivation for filter

For example, there is a simple dense layer, then increasing or decreasing single tensor's weight results in change to cost function. But for filters, how can you decide that increasing or decreasing the value of filter would give better or worst result? Because there is no mathematical relationship like in weight and biases.

Filters are just transformers of image. One way to solve this, we have to change a small amount in single filter and run the whole image dataset to check the cost function. But it gives just a single filter's single value gradient.

I'm pretty sure this is not a way to do it.

I researched this question for more than a month and wherever I go, it states that 'Filters are learned from back propagation' but no one explains the maths behind it. There are plenty of videos that explain backpropagation for dense layer but "no video I found which state filter learning"

Best Answer

Because there is no mathematical relationship like in weight and biases.

Of course, there is a mathematical relationship that binds filter coefficients to the loss function. The relationship is defined by a slightly modified version of the convolution operator. But, it can be still written in simple terms.

Let's say you've 3x3 image, $I$, and a 2x2 filter $W$. Sliding this filter over the image will produce 2x2 output (no padding). The for elements of this output would be

$$ \begin{align}o_{11}&=I_{11}W_{11}+I_{12}W_{12}\\&+I_{21}W_{21}+I_{22}W_{22}\\\\ o_{12}&=I_{12}W_{11}+I_{13}W_{12}\\&+I_{22}W_{21}+I_{23}W_{22}\\\\ o_{21}&=I_{21}W_{11}+I_{22}W_{12}\\&+I_{31}W_{21}+I_{32}W_{22}\\\\ o_{22}&=I_{22}W_{11}+I_{23}W_{12}\\&+I_{32}W_{21}+I_{33}W_{22}\\\\ \end{align} $$

$o_{ij}$ represents $i$-th row and $j$-th column (similarly for input and the filter).

Sometimes, the output is flattened and fed into a dense layer forward. In this case, It'll be a vector, something like $[o_{11} \ o_{12}\ o_{21}\ o_{22}]$ that goes into further layers.

Or, the next layer can be pooling and then this output can be fed into a dense layer, after flattening if necessary. For example, if it's average pooling with 2x2 pool size, we'd have a single output: $$o=\frac{o_{11}+o_{12}+o_{21}+o_{22}}{4}$$ which is sometimes called as global pooling. But, nonetheless, it's in a form that can reach the final layer, with precisely defined mathematical relationships.

I believe you'd have no problem of differentiating the loss function with respect to $W_{ij}$ and form a gradient matrix

$$\frac{\partial L}{\partial W}=\begin{bmatrix}\frac{\partial L}{\partial W_{11}}&\frac{\partial L}{\partial W_{12}}\\ \frac{\partial L}{\partial W_{21}} & \frac{\partial L}{\partial W_{22}}\end{bmatrix}$$

Related Solutions

Solved – How to Learn Kernels/Filters for Deep CNN’s

how do you decide how many will be used at each layer

See

how/are these kernels represented in the NN diagram

You can think of a convolutional layer as a traditional feedforward layer with shared weights and fewer connections than if it was fully connected.

By fewer connections, I mean something like:

But dont understand how it is possible to have mutliple kernels run over the image in one layer. Looking at the first diagram I would have assumed that each layer would be responsible for one and only one convolution?

Just add more hidden units.

Solved – Gradient backpropagation through ResNet skip connections

Add sends the gradient back equally to both inputs. You can convince yourself of this by running the following in tensorflow:

import tensorflow as tf

graph = tf.Graph()
with graph.as_default():
    x1_tf = tf.Variable(1.5, name='x1')
    x2_tf = tf.Variable(3.5, name='x2')
    out_tf = x1_tf + x2_tf

    grads_tf = tf.gradients(ys=[out_tf], xs=[x1_tf, x2_tf])
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        fd = {
            out_tf: 10.0
        }
        print(sess.run(grads_tf, feed_dict=fd))

Output:

[1.0, 1.0]

So, the gradient will be:

passed back to previous layers, unchanged, via the skip-layer connection, and also
passed to the block with weights, and used to update those weights

Edit: there is a question: "what is the operation at the point where the highway connection and the neural net block join back together again, at the bottom of Figure 2?"

There answer is: they are summed. You can see this from Figure 2's formula:

$$ \mathbf{\text{output}} \leftarrow \mathcal{F}(\mathbf{x}) + \mathbf{x} $$

What this says is that:

the values in the bus ($\mathbf{x}$)
are added to the results of passing the bus values, $\mathbf{x}$, through the network, ie $\mathcal{F}(\mathbf{x})$
to give the output from the residual block, which I've labelled here as $\mathbf{\text{output}}$

Edit 2:

Rewriting in slightly different words:

in the forwards direction, the input data flows down the bus
- at points along the bus, residual blocks can learn to add/remove values to the bus vector
in the backwards direction, the gradients flow back down the bus
- along the way, the gradients update the residual blocks they move past
- the residual blocks will themselves modify the gradients slightly too

The residual blocks do modify the gradients flowing backwards, but there are no 'squashing' or 'activation' functions that the gradients flow through. 'squashing'/'activation' functions are what causes the exploding/vanishing gradient problem, so by removing those from the bus itself, we mitigate this problem considerably.

Edit 3: Personally I imagine a resnet in my head as the following diagram. Its topologically identical to figure 2, but it shows more clearly perhaps how the bus just flows straight through the network, whilst the residual blocks just tap the values from it, and add/remove some small vector against the bus:

Best Answer

Related Solutions

Solved – How to Learn Kernels/Filters for Deep CNN’s

Solved – Gradient backpropagation through ResNet skip connections

Related Question