Solved – How to convert fully connected layer into convolutional layer?

conv-neural-networkconvolutiondeep learningmachine learningneural networks

When using a fully-connected network (FCN), I have problem understanding how fully-connected (FC) layer to convolutional layer conversion actually works, even after reading http://cs231n.github.io/convolutional-networks/#convert.

In their explanation, it's said that:
enter image description here

In this example, as far as I understood, the converted CONV layer should have the shape (7,7,512), meaning (width, height, feature dimension). And we have 4096 filters. And the output of each filter's spatial size can be calculated as (7-7+0)/1 + 1 = 1. Therefore we have a 1x1x4096 vector as output.

Although the converted layer can give us output with same size, how can we make sure they are indeed functionally equivalent? It's mentioned in the later part of the post that we need to reshape the weight matrix of FC layer to CONV layer filters. But I am still confusing about how to actually implement it.

Any explanation or link to other learning resource would be welcome.

Best Answer

Inspired by @dk14 's answer, now I have a clearer mind on this question, though I don't completely agree with his answer. And I hope to post mine online for more confirmation.

On a vanilla case, where the input of original AlexNet is still (224,224,3), after a series of Conv layer and pooling, we reach the last Conv layer. At this moment, the size of the image turns into (7,7,512).

At the converted Conv layer(converted from FC1), we have 4096 * (7,7,512) filters overall, which generates (1,1,4096) vector for us. At the second converted Conv layer(converted from FC2), we have 4096 * (1,1,4096) filters, and they give us a output vector (1,1,4096). It's very important for us to remember that, in the conversion, filter size must match the input volume size. That's why we have one by one filter here. Similarily, the last converted Conv layer have 1000 * (1,1,4096) filters and will give us a result for 1000 classes.

The processed is summarized in the post: http://cs231n.github.io/convolutional-networks/#convert. enter image description here

In FC1, the original matrix size should be (7*7*512, 4096), meaning each one of the 4096 neuron in FC2 is connected with every neuron in FC1. While after conversion, the matrix size becomes (7,7,512,4096), meaning we have 4096 (7,7,512) matrixes. It's like taking out each row of the original gigantic matrix, and reshape it accordingly.