I have some intermediate knowledge of Image-Classification using convolutional neural networks. I'm pretty aware to concepts like 'gradient descent, 'derivatives', 'back-propagation & 'weight update process'.

I know that filters are randomly initialized and during training they are learned. But I really can't digest the concept of **derivation for filter**

For example, there is a simple dense layer, then increasing or decreasing single tensor's weight results in change to cost function. **But for filters, how can you decide that increasing or decreasing the value of filter would give better or worst result?** Because there is no mathematical relationship like in weight and biases.

Filters are just transformers of image. One way to solve this, we have to change a small amount in single filter and run the whole image dataset to check the cost function. But it gives just a single filter's single value gradient.

I'm pretty sure this is not a way to do it.

I researched this question for more than a month and wherever I go, it states that 'Filters are learned from back propagation' but no one explains the maths behind it. There are plenty of videos that explain backpropagation for dense layer but "no video I found which state filter learning"

## Best Answer

Of course, there is a mathematical relationship that binds filter coefficients to the loss function. The relationship is defined by a slightly modified version of the convolution operator. But, it can be still written in simple terms.

Let's say you've 3x3 image, $I$, and a 2x2 filter $W$. Sliding this filter over the image will produce 2x2 output (no padding). The for elements of this output would be

$$ \begin{align}o_{11}&=I_{11}W_{11}+I_{12}W_{12}\\&+I_{21}W_{21}+I_{22}W_{22}\\\\ o_{12}&=I_{12}W_{11}+I_{13}W_{12}\\&+I_{22}W_{21}+I_{23}W_{22}\\\\ o_{21}&=I_{21}W_{11}+I_{22}W_{12}\\&+I_{31}W_{21}+I_{32}W_{22}\\\\ o_{22}&=I_{22}W_{11}+I_{23}W_{12}\\&+I_{32}W_{21}+I_{33}W_{22}\\\\ \end{align} $$

$o_{ij}$ represents $i$-th row and $j$-th column (similarly for input and the filter).

Sometimes, the output is flattened and fed into a dense layer forward. In this case, It'll be a vector, something like $[o_{11} \ o_{12}\ o_{21}\ o_{22}]$ that goes into further layers.

Or, the next layer can be pooling and then this output can be fed into a dense layer, after flattening if necessary. For example, if it's average pooling with 2x2 pool size, we'd have a single output: $$o=\frac{o_{11}+o_{12}+o_{21}+o_{22}}{4}$$ which is sometimes called as global pooling. But, nonetheless, it's in a form that can reach the final layer, with

preciselydefined mathematical relationships.I believe you'd have no problem of differentiating the loss function with respect to $W_{ij}$ and form a gradient matrix

$$\frac{\partial L}{\partial W}=\begin{bmatrix}\frac{\partial L}{\partial W_{11}}&\frac{\partial L}{\partial W_{12}}\\ \frac{\partial L}{\partial W_{21}} & \frac{\partial L}{\partial W_{22}}\end{bmatrix}$$