Neural Networks – Understanding the Weight’s Shape of Convolution Neural Networks

conv-neural-networkneural networks

I've have read that cnn have neuron per pixel but also read that it is not true. so what is the actual answer? and what I know is cnn tries to adjust the weight matrix which is also a kernel matrix, i might have been wrong about this so don't judge me, then how can we have neuron per pixel? if we have neuron per pixel then we should weight matrix equal to pixel dimension???

Can anybody explain me inner working of cnn with dimension and shape with the help of tensor?

Best Answer

A CNN (strictly, a convolutional layer in a neural network) often has a neuron for each pixel. However, it doesn't have an independently-estimated set of weights on each neuron; the weights are constrained to be the same across all the neurons in a layer, and lots of them are constrained to be zero

That is, the output for a pixel $j$ is still $\sigma(b_j+\sum_i w_{ij} x_i)$ where $x_i$ is the input for pixel $i$, $w_{ij}$ is the weight for input pixel $i$, and $b$ is the bias, but $w_{ij}$ is defined in terms of the relative positions of $i$ and $j$. If pixels $i$ and $j$ are close, $w_{ij}$ gets estimated; if they are not close $w_{ij}$ is just set to zero. 'Close' in this context might mean 'adjacent' or it might mean in the same small patch; the 'AlexNet' CNN that made CNNs famous used $11\times 11$ patches.

On top of this, the weights $w_{ij}$ that do get estimated, the ones for 'close' points $j$, are constrained to be the same for each $i$. That is, $w_{ii}$ will be the same for all $i$, and $w_{i,\text{the point just left of i}}$ will be the same for all $i$, and $w_{i,\text{the point two left and one up from $i$}}$ will be the same for all $i$. This constraint is what's usually written in terms of a convolutional filter, but you can think of it as just a constraint on estimating the parameters.

As a result, while you have a neuron per pixel, you only have a handful of weights for the whole layer.

And finally, you don't always have a neuron per pixel; sometimes you have one for every few pixels in a spaced-out grid.

Relation to Word2Vec

==========================================

Word2Vec in a simple picture:

word2vec pic

More in-depth explanation:

I believe it's related to the recent Word2Vec innovation in natural language processing. Roughly, Word2Vec means our vocabulary is discrete and we will learn an map which will embed each word into a continuous vector space. Using this vector space representation will allow us to have a continuous, distributed representation of our vocabulary words. If for example our dataset consists of n-grams, we may now use our continuous word features to create a distributed representation of our n-grams. In the process of training a language model we will learn this word embedding map. The hope is that by using a continuous representation, our embedding will map similar words to similar regions. For example in the landmark paper Distributed Representations of Words and Phrases and their Compositionality, observe in Tables 6 and 7 that certain phrases have very good nearest neighbour phrases from a semantic point of view. Transforming into this continuous space allows us to use continuous metric notions of similarity to evaluate the semantic quality of our embedding.

Explanation using Lasagne code

Let's break down the Lasagne code snippet:

x = T.imatrix()

x is a matrix of integers. Okay, no problem. Each word in the vocabulary can be represented an integer, or a 1-hot sparse encoding. So if x is 2x2, we have two datapoints, each being a 2-gram.

l_in = InputLayer((3, ))

The input layer. The 3 represents the size of our vocabulary. So we have words $w_0, w_1, w_2$ for example.

W = np.arange(3*5).reshape((3, 5)).astype('float32')

This is our word embedding matrix. It is a 3 row by 5 column matrix with entries 0 to 14.

Up until now we have the following interpretation. Our vocabulary has 3 words and we will embed our words into a 5 dimensional vector space. For example, we may represent one word $w_0 = (1,0,0)$, and another word $w_1 = (0, 1, 0)$ and the other word $w_2 = (0, 0, 1)$, e.g. as hot sparse encodings. We can view the $W$ matrix as embedding these words via matrix multiplication. Therefore the first word $w_0 \rightarrow w_0W = [0, 1, 2, 3, 4].$ Simmilarly $w_1 \rightarrow w_1W = [5, 6, 7, 8, 9]$.

It should be noted, due to the one-hot sparse encoding we are using, you also see this referred to as table lookups.

l1 = EmbeddingLayer(l_in, input_size=3, output_size=5, W=W)

The embedding layer

 output = get_output(l1, x)

Symbolic Theano expression for the embedding.

f = theano.function([x], output)

Theano function which computes the embedding.

x_test = np.array([[0, 2], [1, 2]]).astype('int32')

It's worth pausing here to discuss what exactly x_test means. First notice that all of x_test entries are in {0, 1, 2}, i.e. range(3). x_test has 2 datapoints. The first datapoint [0, 2] represents the 2-gram $(w_0, w_2)$ and the second datapoint represents the 2-gram $(w_1, w_2)$.

We wish to embed our 2-grams using our word embedding layer now. Before we do that, let's make sure we're clear about what should be returned by our embedding function f. The 2 gram $(w_0, w_2)$ is equivalent to a [[1, 0, 0], [0, 0, 1]] matrix. Applying our embedding matrix W to this sparse matrix should yield: [[0, 1, 2, 3, 4], [10, 11, 12, 13, 14]]. Note in order to have the matrix multiplication work out, we have to apply the word embedding matrix $W$ via right multiplication to the sparse matrix representation of our 2-gram.

f(x_test)

returns:

          array([[[  0.,   1.,   2.,   3.,   4.],
                  [ 10.,  11.,  12.,  13.,  14.]],
                 [[  5.,   6.,   7.,   8.,   9.],
                  [ 10.,  11.,  12.,  13.,  14.]]], dtype=float32)

To convince you that the 3 does indeed represent the vocabulary size, try inputting a matrix x_test = [[5, 0], [1, 2]]. You will see that it raises a matrix mis-match error.

Solved – Are there mathematical reasons for convolution in neural networks beyond expediency

There are no differences in what neural networks can do when they use convolution or correlation. This is because the filters are learned and if a CNN can learn to do a particular task using convolution operation, it can also learn to do the same task using correlation operation (It would learn the rotated version of each filter).

To find more details about the reasons that people sometimes find it more intuitive to think about convolution than correlation, this post may be useful.

There remains this question that if there is no difference between convolution and cross-correlation, what is the point of flipping the weights into the kernel matrix? I would like to include some sentences from the Deep learning book by Ian Goodfellow et al. to answer this question:

The only reason to flip the kernel is to obtain the commutative property. While the commutative property is useful for writing proofs, it is not usually an important property of a neural network implementation... Many machine learning libraries implement cross-correlation but call it convolution.

The takeaway is that although convolution is a favorite operation in classic machine vision applications, it is replaced by correlation in many of the implementations of the convolutional neural networks.

Best Answer

Related Solutions

Neural Networks – What is an Embedding Layer in a Neural Network?

Relation to Word2Vec

Word2Vec in a simple picture:

More in-depth explanation:

Explanation using Lasagne code

Solved – Are there mathematical reasons for convolution in neural networks beyond expediency

Related Question