Solved – Can Neural Network take multidimensional inputs

conv-neural-networkmachine learningnatural languageneural networksrecurrent neural network

I recently learned RNN and find that a common feature of it and CNN is that they use either a LSTM cell or Convolution to process a single multidimensional input (like images and word embeddings which size might be m*n) to a single number(assuming batch size = 1) and then use softmax or feed it into a neural network to get the final result. So I search for some cases of how a pure Neural Network can process multidimensional data like a image or a sentence. A classical way for image processing in a neural network is first flatten a 2D inputs to a vector (if an image is 64*64 then the size of vector is 4096) and this vector is going to be feed into a neural network which means at this time a single input becomes a number instead of a 2D matrix.

So my question is what if we don't flatten it at the beginning and just directly feed into a Neural Network (Fully Connected)? Is it possible? Is the operation between weights and inputs in Neural Network going to become element-wise like what it is in RNN?

The reason I ask this question is that from my point of view the position of each element in a 2D matrix might indicates some special information that you sometimes don't want to ignore, for example, no one flatten a word embedding matrix and directly feed it into a Neural Network for text classification tasks.

Thank you!

Best Answer

Yes, preserving this spatial structure is exactly what makes convolutional neural networks so good at their job.

Is the operation between weights and inputs in Neural Network going to become element-wise like what it is in RNN?

I'm not clear on exactly what you mean by element-wise in this context, but the weights used in a CNN are shared across spatial locations of the input, inducing some sort of translation-invariance/equivariance prior.

Relation to Word2Vec

==========================================

Word2Vec in a simple picture:

word2vec pic

More in-depth explanation:

I believe it's related to the recent Word2Vec innovation in natural language processing. Roughly, Word2Vec means our vocabulary is discrete and we will learn an map which will embed each word into a continuous vector space. Using this vector space representation will allow us to have a continuous, distributed representation of our vocabulary words. If for example our dataset consists of n-grams, we may now use our continuous word features to create a distributed representation of our n-grams. In the process of training a language model we will learn this word embedding map. The hope is that by using a continuous representation, our embedding will map similar words to similar regions. For example in the landmark paper Distributed Representations of Words and Phrases and their Compositionality, observe in Tables 6 and 7 that certain phrases have very good nearest neighbour phrases from a semantic point of view. Transforming into this continuous space allows us to use continuous metric notions of similarity to evaluate the semantic quality of our embedding.

Explanation using Lasagne code

Let's break down the Lasagne code snippet:

x = T.imatrix()

x is a matrix of integers. Okay, no problem. Each word in the vocabulary can be represented an integer, or a 1-hot sparse encoding. So if x is 2x2, we have two datapoints, each being a 2-gram.

l_in = InputLayer((3, ))

The input layer. The 3 represents the size of our vocabulary. So we have words $w_0, w_1, w_2$ for example.

W = np.arange(3*5).reshape((3, 5)).astype('float32')

This is our word embedding matrix. It is a 3 row by 5 column matrix with entries 0 to 14.

Up until now we have the following interpretation. Our vocabulary has 3 words and we will embed our words into a 5 dimensional vector space. For example, we may represent one word $w_0 = (1,0,0)$, and another word $w_1 = (0, 1, 0)$ and the other word $w_2 = (0, 0, 1)$, e.g. as hot sparse encodings. We can view the $W$ matrix as embedding these words via matrix multiplication. Therefore the first word $w_0 \rightarrow w_0W = [0, 1, 2, 3, 4].$ Simmilarly $w_1 \rightarrow w_1W = [5, 6, 7, 8, 9]$.

It should be noted, due to the one-hot sparse encoding we are using, you also see this referred to as table lookups.

l1 = EmbeddingLayer(l_in, input_size=3, output_size=5, W=W)

The embedding layer

 output = get_output(l1, x)

Symbolic Theano expression for the embedding.

f = theano.function([x], output)

Theano function which computes the embedding.

x_test = np.array([[0, 2], [1, 2]]).astype('int32')

It's worth pausing here to discuss what exactly x_test means. First notice that all of x_test entries are in {0, 1, 2}, i.e. range(3). x_test has 2 datapoints. The first datapoint [0, 2] represents the 2-gram $(w_0, w_2)$ and the second datapoint represents the 2-gram $(w_1, w_2)$.

We wish to embed our 2-grams using our word embedding layer now. Before we do that, let's make sure we're clear about what should be returned by our embedding function f. The 2 gram $(w_0, w_2)$ is equivalent to a [[1, 0, 0], [0, 0, 1]] matrix. Applying our embedding matrix W to this sparse matrix should yield: [[0, 1, 2, 3, 4], [10, 11, 12, 13, 14]]. Note in order to have the matrix multiplication work out, we have to apply the word embedding matrix $W$ via right multiplication to the sparse matrix representation of our 2-gram.

f(x_test)

returns:

          array([[[  0.,   1.,   2.,   3.,   4.],
                  [ 10.,  11.,  12.,  13.,  14.]],
                 [[  5.,   6.,   7.,   8.,   9.],
                  [ 10.,  11.,  12.,  13.,  14.]]], dtype=float32)

To convince you that the 3 does indeed represent the vocabulary size, try inputting a matrix x_test = [[5, 0], [1, 2]]. You will see that it raises a matrix mis-match error.

Solved – How tomplement a Neural Network for Multidimensional Input Classification

Well, the whole idea of machine learning is that you let the algorithm decide which inputs are important and which aren't. At times, we might want to restrict the algorithm to focus on combining inputs in certain ways. An example for images is convolutional neural nets. Here we use the fact that patterns in images are based on the values of near-by pixels.
What you could do for example is first feed the inputs in one dimension into sub-nets and then combine the outputs of those nets into a final net. For example first feed all the means into one net and the variances into another and then combine the outputs of those nets into a final net.
However you could also feed all properties of each pixel into subnets and then feed the outputs of each pixel into a bigger net. In this case, it isn't obvious what would give better results and why.
So my advice would be that if you have obvious structure like in images, you might make use of it like in convolutional neural nets. However if it isn't obvious, just let the network figure it out. That is what it's for.

Best Answer

Related Solutions

Neural Networks – What is an Embedding Layer in a Neural Network?

Relation to Word2Vec

Word2Vec in a simple picture:

More in-depth explanation:

Explanation using Lasagne code

Solved – How tomplement a Neural Network for **Multidimensional** Input Classification

Related Question

Solved – How tomplement a Neural Network for Multidimensional Input Classification