Solved – In CNN, are upsampling and transpose convolution the same

conv-neural-networkmachine learningneural networkstransposed-convolution

Both the terms "upsampling" and "transpose convolution" are used when you are doing "deconvolution" (<– not a good term, but let me use it here). Originally, I thought that they mean the same thing, but it seems to me that they are different after I read these articles. can anyone please clarify?

  1. Transpose convolution: looks like we can use it when we propoagate the loss via convolutonal neural network.

    http://andrew.gibiansky.com/blog/machine-learning/convolutional-neural-networks/#Backward-Propagation

    https://github.com/vdumoulin/conv_arithmetic

    https://arxiv.org/pdf/1312.6034v2.pdf, section 4 "For the convolutional layer…"

  2. Upsampling: seems like we use it when we want to upsample from smaller input to larger input in convnet-decovnet structure.

    https://www.youtube.com/watch?v=ByjaPdWXKJ4&feature=youtu.be&t=22m

Best Answer

Since there is no detailed and marked answer, I'll try my best.

Let's first understand where the motivation for such layers come from: e.g. a convolutional autoencoder. You can use a convolutional autoencoder to extract featuers of images while training the autoencoder to reconstruct the original image. (It is an unsupervised method.)

Such an autoencoder has two parts: The encoder that extracts the features from the image and the decoder that reconstructs the original image from these features. The architecture of the encoder and decoder are usually mirrored.

In a convolutional autoencoder, the encoder works with convolution and pooling layers. I assume that you know how these work. The decoder tries to mirror the encoder but instead of "making everything smaller" it has the goal of "making everything bigger" to match the original size of the image.

The opposite of the convolutional layers are the transposed convolution layers (also known as deconvolution, but correctly mathematically speaking this is something different). They work with filters, kernels, strides just as the convolution layers but instead of mapping from e.g. 3x3 input pixels to 1 output they map from 1 input pixel to 3x3 pixels. Of course, also backpropagation works a little bit different.

The opposite of the pooling layers are the upsampling layers which in their purest form only resize the image (or copy the pixel as many times as needed). A more advanced technique is unpooling which resverts maxpooling by remembering the location of the maxima in the maxpooling layers and in the unpooling layers copy the value to exactly this location. To quote from this (https://arxiv.org/pdf/1311.2901v3.pdf) paper:

In the convnet, the max pooling operation is non-invertible, however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variables. In the deconvnet, the unpooling operation uses these switches to place the reconstructions from the layer above into appropriate locations, preserving the structure of the stimulus.

For more technical input and context have a look at this really good, demonstrative and in depth explanation: http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html

And have a look at https://www.quora.com/What-is-the-difference-between-Deconvolution-Upsampling-Unpooling-and-Convolutional-Sparse-Coding

Related Question