Solved – Why do CNNs conclude with FC layers

conv-neural-networkneural networksrandom forestsvm

From my understanding, CNNs consist of two parts. The first part (conv/pool layers) which does the feature extraction and the second part (fc layers) which does the classification from the features.

Since fully-connected neural nets aren't the best classifiers (i.e. they get outperformed by SVMs and RFs most of the time), why do CNNs conclude with FC layers, rather than let's say an SVM or a RF?

Best Answer

It's not so simple. First of all, a SVM is, in a way, a type of neural network (you can learn a SVM solution through backpropagation). See What *is* an Artificial Neural Network?. Second, you can't know beforehand which model will work better, but the thing is with a fully neuromorphic architecture you can learn the weights end-to-end, while attaching a SVM or RF to the last hidden layer activation of a CNN is simply an ad hoc procedure. It may perform better, and it may not, we can't know without testing.

The important part is that a fully convolutional architecture is capable of representation learning, which is useful for a myriad of reasons. For once, it may reduce or eliminate feature engineering altogether in your problem.

About the FC layers, they are mathematically equivalent to 1x1 Convolutional layers. See Yann Lecun's post, which I transcript below:

In Convolutional Nets, there is no such thing as "fully-connected layers". There are only convolution layers with 1x1 convolution kernels and a full connection table.

It's a too-rarely-understood fact that ConvNets don't need to have a fixed-size input. You can train them on inputs that happen to produce a single output vector (with no spatial extent), and then apply them to larger images. Instead of a single output vector, you then get a spatial map of output vectors. Each vector sees input windows at different locations on the input.

In that scenario, the "fully connected layers" really act as 1x1 convolutions.

Related Question