Solved – the “expressive power” of the composition function in a Recursive Neural Tensor Network

deep learninglinear algebramachine learningneural networksrecurrent neural network

I wasn't entire sure if this question belongs here or on math.stackexchange, since it's only partially about machine learning, but perhaps more of a conceptual question about linear algebra (or tensor algebra). Please let me know if you believe this question is better asked on another SE site.


The type of network I am reading about is described for example here:

Socher et al: Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

By my current level of understanding, the basic network architecture is very similar to a recursive neural network, i.e. a recurrent neural network where the unfolded network given some finite input is expressed as a (usually: binary) tree, instead of a "flat" chain (as in the recurrent network).

The difference then rests in the composition function at each of the tree's nodes. The parent vector $p$ of two children vectors (either input words, or parent vectors themselves), is calculated as follows:

$p = f(\begin{bmatrix} b \\ c \end{bmatrix}^T V^{[1:d]} \begin{bmatrix} b \\ c \end{bmatrix} + W\begin{bmatrix} b \\ c \end{bmatrix})$

Here, $f$ is a nonlinearity like tanh, $b$, $c$ are $d$-dimensional children vectors, and $V^{[1:d]}$ is one of $d$ 'slices' of the tensor (author's terminology), and the $W\begin{bmatrix} b \\ c \end{bmatrix}$ is the usual (non-tensor network) standard layer multiplication added at the end.

Please correct me if I made any mistakes so far.

My question is about what exactly changes, in terms of "expressive power" and the network's ability to learn, when we go from a single weight matrix to "slices of a tensor" as above.

I'll try to break down my uncertainty about this approach into smaller partial questions:

  1. In principle, the units (of a standard multi-layer ANN, just as well as those of a "tree-shaped" RNN) are already able to capture any function we are interested in. Given a sufficient number of hidden units, a network approximates a non-linear function as a linear combination of the non-linearity (like tanh) – if this function is a universal basis function. Under this view, I am wondering what the additional "power" of the tensor-based units is?

  2. One difference that I can see is that the $d$ "slices" of a tensor are essentially a way to increase the dimensionality of the units and the network in total, while leaving the input dimensionality unchanged. Instead of a $[d \, \times \, 2d]$ matrix, our units now include $d$ times a $[2d \, \times \, 2d]$ matrix ($V$ above). This to me looks like the network simply has more dimensions for learning the same input – but as a trade-off, this could also increase the risk of overfitting. Correct?

  3. In a lecture script, Socher remarks on the difference between the two that in a regular recursive network, the interaction of the two input vectors is "implicit" only (through the non-linearity), while in tensor networks, the interaction between them is "multiplicative". I am not sure what he means by that. In the regular recursive network, two $d$-dimensional input vectors are concatenated, then multiplied by a $[d \, \times \, 2d]$ matrix, resulting in a $d$ dimensional vector (that can be used as input for the next tree node again). In the tensor network above, the two vectors are individually multiplied by a $[2d \, \times \, 2d]$ matrix, resulting in a scalar – and we then "reconstruct" the original vector size by using $d$ such tensor "slices". I seem to miss how in the first case the vectors "interact" less than in the second case – could anyone give me the intuition here?

  4. Finally, a natural way of thinking of the learning power of networks seems to be in terms of graphical "projections", i.e. linear transformations like e.g. rotation or reflection. My formal understanding of tensors is limited, especially in the case of the tensor "slices" approach taken by Socher. As a result, I don't know whether this tensor approach allows for more "complex" transformations than those possible in a regular recursive network (using only a single matrix). More specifically, it seems that because of the additional "slices", we have more dimensions to work with (as mentioned above), but that the transformations are still essentially the same we had in the single matrix case, i.e. linear transformations. Is this view correct?

Thank you for reading this (rather long) question. Any remarks or corrections would be appreciated.

Best Answer

Even though neural networks are theoretically capable of learning any function if given arbitrarily many units, in practice, they have limited flexibility. And in the case of this paper, I believe the network is only as deep as the tree being parsed, which is to say, not very deep at all, so it has even more limited ability to approximate arbitrary functions.

Given this, we want to make the job as easy as possible. If the neural network is to approximate a quadratic function, we should give it quadratic units instead of linear units, so that it has a better chance of succeeding.

Why should we expect quadratic interactions between input variables to help? The argument is that the meaning of words often interact in multiplicative ways. For example, suppose "sad" has a meaning which is represented by the scalar $-3.1$. Then, we might expect that the word "very" in "very sad" would perform some sort of multiplicative effect on the value of "sad". For example, "very sad" might be $-6.2$, which would be computed by a product of the multiplier of "very": $2$, and the value of "cry". Similar reasoning applies to words like "not", "slightly", and "because".

The composition layer of the RNTN computes many of these quadratic functions, each returning a scalar, and then concatenates all the results into a vector.

How does this compare to the usual concatenation and nonlinearity? In the usual case, the output is $$f \left( W \begin{bmatrix} a \\ b \end{bmatrix} \right) = f(W_a a + W_bb)$$

We take a linear transformation of each input, add them, and then apply a nonlinearity. Everything is mostly additive and it would be pretty challenging to do any multiplicative transformations like I described before.

It is true that this tensor model has many more parameters and is much more vulnerable to overfitting, but I suppose with adequate regularization it is not a huge problem.

Related Question