Solved – In CNN, are upsampling and transpose convolution the same

conv-neural-networkmachine learningneural networkstransposed-convolution

Both the terms "upsampling" and "transpose convolution" are used when you are doing "deconvolution" (<– not a good term, but let me use it here). Originally, I thought that they mean the same thing, but it seems to me that they are different after I read these articles. can anyone please clarify?

Transpose convolution: looks like we can use it when we propoagate the loss via convolutonal neural network.

http://andrew.gibiansky.com/blog/machine-learning/convolutional-neural-networks/#Backward-Propagation

https://github.com/vdumoulin/conv_arithmetic

https://arxiv.org/pdf/1312.6034v2.pdf, section 4 "For the convolutional layer…"
Upsampling: seems like we use it when we want to upsample from smaller input to larger input in convnet-decovnet structure.

https://www.youtube.com/watch?v=ByjaPdWXKJ4&feature=youtu.be&t=22m

Best Answer

Since there is no detailed and marked answer, I'll try my best.

Let's first understand where the motivation for such layers come from: e.g. a convolutional autoencoder. You can use a convolutional autoencoder to extract featuers of images while training the autoencoder to reconstruct the original image. (It is an unsupervised method.)

Such an autoencoder has two parts: The encoder that extracts the features from the image and the decoder that reconstructs the original image from these features. The architecture of the encoder and decoder are usually mirrored.

In a convolutional autoencoder, the encoder works with convolution and pooling layers. I assume that you know how these work. The decoder tries to mirror the encoder but instead of "making everything smaller" it has the goal of "making everything bigger" to match the original size of the image.

The opposite of the convolutional layers are the transposed convolution layers (also known as deconvolution, but correctly mathematically speaking this is something different). They work with filters, kernels, strides just as the convolution layers but instead of mapping from e.g. 3x3 input pixels to 1 output they map from 1 input pixel to 3x3 pixels. Of course, also backpropagation works a little bit different.

The opposite of the pooling layers are the upsampling layers which in their purest form only resize the image (or copy the pixel as many times as needed). A more advanced technique is unpooling which resverts maxpooling by remembering the location of the maxima in the maxpooling layers and in the unpooling layers copy the value to exactly this location. To quote from this (https://arxiv.org/pdf/1311.2901v3.pdf) paper:

In the convnet, the max pooling operation is non-invertible, however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variables. In the deconvnet, the unpooling operation uses these switches to place the reconstructions from the layer above into appropriate locations, preserving the structure of the stimulus.

For more technical input and context have a look at this really good, demonstrative and in depth explanation: http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html

And have a look at https://www.quora.com/What-is-the-difference-between-Deconvolution-Upsampling-Unpooling-and-Convolutional-Sparse-Coding

Related Solutions

Solved – In a convolutional neural network (CNN), when convolving the image, is the operation used the dot product or the sum of element-wise multiplication

Any given layer in a CNN has typically 3 dimensions (we'll call them height, width, depth). The convolution will produce a new layer with a new (or same) height, width and depth. The operation however is performed differently on the height/width and differently on the depth and this is what I think causes confusion.

Let's first see how the convolution operation on the height and width of the input matrix. This case is performed exactly as depicted in your image and is most certainly an element-wise multiplication of the two matrices.

In theory:
Two-dimensional (discrete) convolutions are calculated by the formula below:

$$C \left[ m, n \right] = \sum_u \sum_υ A \left[ m + u, n + υ\right] \cdot B \left[ u, υ \right]$$

As you can see each element of $C$ is calculated as the sum of the products of a single element of $A$ with a single element of $B$. This means that each element of $C$ is computed from the sum of the element-wise multiplication of $A$ and $B$.

In practice:
You could test the above example with any number of packages (I'll use scipy):

import numpy as np
from scipy.signal import convolve2d

A = np.array([[1,1,1,0,0],[0,1,1,1,0],[0,0,1,1,1],[0,0,1,1,0],[0,1,1,0,0]])
B = np.array([[1,0,1],[0,1,0],[1,0,1]])
C = convolve2d(A, B, 'valid')
print(C)

The code above will produce:

[[4 3 4]
 [2 4 3]
 [2 3 4]]

Now, the convolution operation on the depth of the input can actually be considered as a dot product as each element of the same height/width is multiplied with the same weight and they are summed together. This is most evident in the case of 1x1 convolutions (typically used to manipulate the depth of a layer without changing it's dimensions). This, however, is not part of a 2D convolution (from a mathematical viewpoint) but something convolutional layers do in CNNs.

Notes:
1: That being said I think most of the sources you provided have misleading explanations to say the least and are not correct. I wasn't aware so many sources have this operation (which is the most essential operation in CNNs) wrong. I guess it has something to do with the fact that convolutions sum the product between scalars and the product between two scalars is also called a dot product.
2: I think that the first reference refers to a Fully Connected layer instead of a Convolutional layer. If that is the case, a FC layer does perform the dot product as stated. I don't have the rest of the context to confirm this.

tl;dr The image you provided is 100% correct on how the operation is performed, however this is not the full picture. CNN layers have 3 dimensions, two of which are handled as depicted. My suggestion would be to check up on how convolutional layers handle the depth of the input (the simplest case you could see are 1x1 convolutions).

Solved – the “credit assignment” problem in Machine Learning and Deep Learning

Perhaps this should be rephrased as "attribution", but in many RL models, the signal that comprises the reinforcement (e.g. the error in the reward prediction for TD) does not assign any single action "credit" for that reward. Was it the right context, but wrong decision? Or the wrong context, but correct decision? Which specific action in a temporal sequence was the right one?

Similarly, in NN, where you have hidden layers, the output does not specify what node or pixel or element or layer or operation improved the model, so you don't necessarily know what needs tuning -- for example, the detectors (pooling & reshaping, activation, etc.) or the weight assignment (part of back propagation). This is distinct from many supervised learning methods, especially tree-based methods, where each decision tells you exactly what lift was given to the distribution segregation (in classification, for example). Part of understanding the credit problem is explored in "explainable AI", where we are breaking down all of the outputs to determine how the final decision was made. This is by either logging and reviewing at various stages (tensorboard, loss function tracking, weight visualizations, layer unrolling, etc.), or by comparing/reducing to other methods (ODEs, Bayesian, GLRM, etc.).

If this is the type of answer you're looking for, comment and I'll wrangle up some references.

Best Answer

Related Solutions

Solved – In a convolutional neural network (CNN), when convolving the image, is the operation used the dot product or the sum of element-wise multiplication

Solved – the “credit assignment” problem in Machine Learning and Deep Learning

Related Question