Solved – the “expressive power” of the composition function in a Recursive Neural Tensor Network

deep learninglinear algebramachine learningneural networksrecurrent neural network

I wasn't entire sure if this question belongs here or on math.stackexchange, since it's only partially about machine learning, but perhaps more of a conceptual question about linear algebra (or tensor algebra). Please let me know if you believe this question is better asked on another SE site.

The type of network I am reading about is described for example here:

Socher et al: Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

By my current level of understanding, the basic network architecture is very similar to a recursive neural network, i.e. a recurrent neural network where the unfolded network given some finite input is expressed as a (usually: binary) tree, instead of a "flat" chain (as in the recurrent network).

The difference then rests in the composition function at each of the tree's nodes. The parent vector $p$ of two children vectors (either input words, or parent vectors themselves), is calculated as follows:

$p = f(\begin{bmatrix} b \\ c \end{bmatrix}^T V^{[1:d]} \begin{bmatrix} b \\ c \end{bmatrix} + W\begin{bmatrix} b \\ c \end{bmatrix})$

Here, $f$ is a nonlinearity like tanh, $b$, $c$ are $d$-dimensional children vectors, and $V^{[1:d]}$ is one of $d$ 'slices' of the tensor (author's terminology), and the $W\begin{bmatrix} b \\ c \end{bmatrix}$ is the usual (non-tensor network) standard layer multiplication added at the end.

Please correct me if I made any mistakes so far.

My question is about what exactly changes, in terms of "expressive power" and the network's ability to learn, when we go from a single weight matrix to "slices of a tensor" as above.

I'll try to break down my uncertainty about this approach into smaller partial questions:

In principle, the units (of a standard multi-layer ANN, just as well as those of a "tree-shaped" RNN) are already able to capture any function we are interested in. Given a sufficient number of hidden units, a network approximates a non-linear function as a linear combination of the non-linearity (like tanh) – if this function is a universal basis function. Under this view, I am wondering what the additional "power" of the tensor-based units is?
One difference that I can see is that the $d$ "slices" of a tensor are essentially a way to increase the dimensionality of the units and the network in total, while leaving the input dimensionality unchanged. Instead of a $[d \, \times \, 2d]$ matrix, our units now include $d$ times a $[2d \, \times \, 2d]$ matrix ($V$ above). This to me looks like the network simply has more dimensions for learning the same input – but as a trade-off, this could also increase the risk of overfitting. Correct?
In a lecture script, Socher remarks on the difference between the two that in a regular recursive network, the interaction of the two input vectors is "implicit" only (through the non-linearity), while in tensor networks, the interaction between them is "multiplicative". I am not sure what he means by that. In the regular recursive network, two $d$-dimensional input vectors are concatenated, then multiplied by a $[d \, \times \, 2d]$ matrix, resulting in a $d$ dimensional vector (that can be used as input for the next tree node again). In the tensor network above, the two vectors are individually multiplied by a $[2d \, \times \, 2d]$ matrix, resulting in a scalar – and we then "reconstruct" the original vector size by using $d$ such tensor "slices". I seem to miss how in the first case the vectors "interact" less than in the second case – could anyone give me the intuition here?
Finally, a natural way of thinking of the learning power of networks seems to be in terms of graphical "projections", i.e. linear transformations like e.g. rotation or reflection. My formal understanding of tensors is limited, especially in the case of the tensor "slices" approach taken by Socher. As a result, I don't know whether this tensor approach allows for more "complex" transformations than those possible in a regular recursive network (using only a single matrix). More specifically, it seems that because of the additional "slices", we have more dimensions to work with (as mentioned above), but that the transformations are still essentially the same we had in the single matrix case, i.e. linear transformations. Is this view correct?

Thank you for reading this (rather long) question. Any remarks or corrections would be appreciated.

Best Answer

Even though neural networks are theoretically capable of learning any function if given arbitrarily many units, in practice, they have limited flexibility. And in the case of this paper, I believe the network is only as deep as the tree being parsed, which is to say, not very deep at all, so it has even more limited ability to approximate arbitrary functions.

Given this, we want to make the job as easy as possible. If the neural network is to approximate a quadratic function, we should give it quadratic units instead of linear units, so that it has a better chance of succeeding.

Why should we expect quadratic interactions between input variables to help? The argument is that the meaning of words often interact in multiplicative ways. For example, suppose "sad" has a meaning which is represented by the scalar $-3.1$. Then, we might expect that the word "very" in "very sad" would perform some sort of multiplicative effect on the value of "sad". For example, "very sad" might be $-6.2$, which would be computed by a product of the multiplier of "very": $2$, and the value of "cry". Similar reasoning applies to words like "not", "slightly", and "because".

The composition layer of the RNTN computes many of these quadratic functions, each returning a scalar, and then concatenates all the results into a vector.

How does this compare to the usual concatenation and nonlinearity? In the usual case, the output is $$f \left( W \begin{bmatrix} a \\ b \end{bmatrix} \right) = f(W_a a + W_bb)$$

We take a linear transformation of each input, add them, and then apply a nonlinearity. Everything is mostly additive and it would be pretty challenging to do any multiplicative transformations like I described before.

It is true that this tensor model has many more parameters and is much more vulnerable to overfitting, but I suppose with adequate regularization it is not a huge problem.

Relation to Word2Vec

==========================================

Word2Vec in a simple picture:

word2vec pic

More in-depth explanation:

I believe it's related to the recent Word2Vec innovation in natural language processing. Roughly, Word2Vec means our vocabulary is discrete and we will learn an map which will embed each word into a continuous vector space. Using this vector space representation will allow us to have a continuous, distributed representation of our vocabulary words. If for example our dataset consists of n-grams, we may now use our continuous word features to create a distributed representation of our n-grams. In the process of training a language model we will learn this word embedding map. The hope is that by using a continuous representation, our embedding will map similar words to similar regions. For example in the landmark paper Distributed Representations of Words and Phrases and their Compositionality, observe in Tables 6 and 7 that certain phrases have very good nearest neighbour phrases from a semantic point of view. Transforming into this continuous space allows us to use continuous metric notions of similarity to evaluate the semantic quality of our embedding.

Explanation using Lasagne code

Let's break down the Lasagne code snippet:

x = T.imatrix()

x is a matrix of integers. Okay, no problem. Each word in the vocabulary can be represented an integer, or a 1-hot sparse encoding. So if x is 2x2, we have two datapoints, each being a 2-gram.

l_in = InputLayer((3, ))

The input layer. The 3 represents the size of our vocabulary. So we have words $w_0, w_1, w_2$ for example.

W = np.arange(3*5).reshape((3, 5)).astype('float32')

This is our word embedding matrix. It is a 3 row by 5 column matrix with entries 0 to 14.

Up until now we have the following interpretation. Our vocabulary has 3 words and we will embed our words into a 5 dimensional vector space. For example, we may represent one word $w_0 = (1,0,0)$, and another word $w_1 = (0, 1, 0)$ and the other word $w_2 = (0, 0, 1)$, e.g. as hot sparse encodings. We can view the $W$ matrix as embedding these words via matrix multiplication. Therefore the first word $w_0 \rightarrow w_0W = [0, 1, 2, 3, 4].$ Simmilarly $w_1 \rightarrow w_1W = [5, 6, 7, 8, 9]$.

It should be noted, due to the one-hot sparse encoding we are using, you also see this referred to as table lookups.

l1 = EmbeddingLayer(l_in, input_size=3, output_size=5, W=W)

The embedding layer

 output = get_output(l1, x)

Symbolic Theano expression for the embedding.

f = theano.function([x], output)

Theano function which computes the embedding.

x_test = np.array([[0, 2], [1, 2]]).astype('int32')

It's worth pausing here to discuss what exactly x_test means. First notice that all of x_test entries are in {0, 1, 2}, i.e. range(3). x_test has 2 datapoints. The first datapoint [0, 2] represents the 2-gram $(w_0, w_2)$ and the second datapoint represents the 2-gram $(w_1, w_2)$.

We wish to embed our 2-grams using our word embedding layer now. Before we do that, let's make sure we're clear about what should be returned by our embedding function f. The 2 gram $(w_0, w_2)$ is equivalent to a [[1, 0, 0], [0, 0, 1]] matrix. Applying our embedding matrix W to this sparse matrix should yield: [[0, 1, 2, 3, 4], [10, 11, 12, 13, 14]]. Note in order to have the matrix multiplication work out, we have to apply the word embedding matrix $W$ via right multiplication to the sparse matrix representation of our 2-gram.

f(x_test)

returns:

          array([[[  0.,   1.,   2.,   3.,   4.],
                  [ 10.,  11.,  12.,  13.,  14.]],
                 [[  5.,   6.,   7.,   8.,   9.],
                  [ 10.,  11.,  12.,  13.,  14.]]], dtype=float32)

To convince you that the 3 does indeed represent the vocabulary size, try inputting a matrix x_test = [[5, 0], [1, 2]]. You will see that it raises a matrix mis-match error.

Best Answer

Related Solutions

Solved – L1-norm cost function for Neural Network. (Regression)

Solved – an embedding layer in a neural network

Relation to Word2Vec

Word2Vec in a simple picture:

More in-depth explanation:

Explanation using Lasagne code

Related Question