Solved – Is convolution neural network (CNN) a special case of multilayer perceptron (MLP)? And why not use MLP for everything

conv-neural-networkmachine learningneural networks

If convolution can be expressed with matrix multiplication (example) Can we say convolution neural network (CNN) is a special case of multilayer perceptron (MLP)?

If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?

Best Answer

A convolution can be expressed as matrix multiplication but the matrix is multiplied with a patch around every position in the image separately. So you go to (1/1) and extract a patch and multiply it with an MLP. Then you do the same thing at position (1/2) and so forth. So obviously there are less degrees of freedom than applying an MLP directly. Most people regard an MLP as a special case of a convolution where the spatial dimensions are 1x1.

Edit Start

Regarding MLP as special case of CNN, some comments do not share this opinion. Yann LeCun, who can be counted as one of the inventors of CNNs, made a similar comment before on FB: https://www.facebook.com/yann.lecun/posts/10152820758292143

He said that in CNNs there is no such thing as a "fully connected" layer, there is only a layer with 1x1 spatial extent and a kernel with 1x1 spatial extent. If one can "convert" FC layers, which are the single layers of MLPs into convolutional layers, then one can obviously also convert an entire MLP into a CNN by interpreting the input as a vector with only channel dimensions.

An example: If I have an image of size $H\times W\times C$ ($C$ channels) and I apply a single layer of an MLP to it, then I will transform the input into a vector $x$ of size $V=HWC$. I will then apply a matrix $W\in \mathbb{R}^{U\times V}$ to it, thereby creating $U$ hidden activations. I could interpret the input vector $x$ as an image with only one pixel but $V$ "channels": $x\in\mathbb{R}^{1\times 1\times V}$ and the weight matrix as a Kernel with only one pixel area but $U$ filters taking in $V$ channels each: $W\in\mathbb{R}^{U\times 1\times 1\times V}$. I can then call some Conv2D function that carries out the operation and computes exactly the same as the MLP.

Edit End

If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?

That is a nice idea (and probably worth doing research on) but it's simply not practical:

  1. The MLP has too many degrees of freedom, it's likely to overfit.
  2. In addition to learning the weights, you would have to learn their dependency structure.

As most Deep Learning research is closely related to NLP/speech processing/computer vision, people are eager to solve their problems and maybe less eager to investigate how a function space more general than a CNN could constrain itself to that particular function space. Though imho it's certainly interesting to think about that.