With the distributions on our random vectors:
$\mathbf x_i | \mathbf \mu \sim N(\mu , \mathbf \Sigma)$
$\mathbf \mu \sim N(\mathbf \mu_0, \mathbf \Sigma_0)$
By Bayes's rule the posterior distribution looks like:
$p(\mu| \{\mathbf x_i\}) \propto p(\mu) \prod_{i=1}^N p(\mathbf x_i | \mu)$
So:
$\ln p(\mu| \{\mathbf x_i\}) = -\frac{1}{2}\sum_{i=1}^N(\mathbf x_i - \mu)'\mathbf \Sigma^{-1}(\mathbf x_i - \mu) -\frac{1}{2}(\mu - \mu_0)'\mathbf \Sigma_0^{-1}(\mu - \mu_0) + const$
$ = -\frac{1}{2} N \mu' \mathbf \Sigma^{-1} \mu + \sum_{i=1}^N \mu' \mathbf \Sigma^{-1} \mathbf x_i -\frac{1}{2} \mu' \mathbf \Sigma_0^{-1} \mu + \mu' \mathbf \Sigma_0^{-1} \mu_0 + const$
$ = -\frac{1}{2} \mu' (N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1}) \mu + \mu' (\mathbf \Sigma_0^{-1} \mu_0 + \mathbf \Sigma^{-1} \sum_{i=1}^N \mathbf x_i) + const$
$= -\frac{1}{2}(\mu - (N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1})^{-1}(\mathbf \Sigma_0^{-1} \mu_0 + \mathbf \Sigma^{-1} \sum_{i=1}^N \mathbf x_i))' (N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1}) (\mu - (N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1})^{-1}(\mathbf \Sigma_0^{-1} \mu_0 + \mathbf \Sigma^{-1} \sum_{i=1}^N \mathbf x_i)) + const$
Which is the log density of a Gaussian:
$\mu| \{\mathbf x_i\} \sim N((N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1})^{-1}(\mathbf \Sigma_0^{-1} \mu_0 + \mathbf \Sigma^{-1} \sum_{i=1}^N \mathbf x_i), (N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1})^{-1})$
Using the Woodbury identity on our expression for the covariance matrix:
$(N \mathbf \Sigma^{-1} + \mathbf \Sigma_0^{-1})^{-1} = \mathbf \Sigma(\frac{1}{N} \mathbf \Sigma + \mathbf \Sigma_0)^{-1} \frac{1}{N} \mathbf \Sigma_0$
Which provides the covariance matrix in the form the OP wanted. Using this expression (and its symmetry) further in the expression for the mean we have:
$\mathbf \Sigma(\frac{1}{N} \mathbf \Sigma + \mathbf \Sigma_0)^{-1} \frac{1}{N} \mathbf \Sigma_0 \mathbf \Sigma_0^{-1} \mu_0
+
\frac{1}{N} \mathbf \Sigma_0(\frac{1}{N} \mathbf \Sigma + \mathbf \Sigma_0)^{-1} \mathbf \Sigma \mathbf \Sigma^{-1} \sum_{i=1}^N \mathbf x_i$
$= \mathbf \Sigma(\frac{1}{N} \mathbf \Sigma + \mathbf \Sigma_0)^{-1} \frac{1}{N} \mu_0
+ \mathbf \Sigma_0(\frac{1}{N} \mathbf \Sigma + \mathbf \Sigma_0)^{-1} \sum_{i=1}^N (\frac{1}{N} \mathbf x_i)$
Which is the form required by the OP for the mean.
Even though neural networks are theoretically capable of learning any function if given arbitrarily many units, in practice, they have limited flexibility. And in the case of this paper, I believe the network is only as deep as the tree being parsed, which is to say, not very deep at all, so it has even more limited ability to approximate arbitrary functions.
Given this, we want to make the job as easy as possible. If the neural network is to approximate a quadratic function, we should give it quadratic units instead of linear units, so that it has a better chance of succeeding.
Why should we expect quadratic interactions between input variables to help? The argument is that the meaning of words often interact in multiplicative ways. For example, suppose "sad" has a meaning which is represented by the scalar $-3.1$. Then, we might expect that the word "very" in "very sad" would perform some sort of multiplicative effect on the value of "sad". For example, "very sad" might be $-6.2$, which would be computed by a product of the multiplier of "very": $2$, and the value of "cry". Similar reasoning applies to words like "not", "slightly", and "because".
The composition layer of the RNTN computes many of these quadratic functions, each returning a scalar, and then concatenates all the results into a vector.
How does this compare to the usual concatenation and nonlinearity? In the usual case, the output is $$f \left( W \begin{bmatrix} a \\ b \end{bmatrix} \right) = f(W_a a + W_bb)$$
We take a linear transformation of each input, add them, and then apply a nonlinearity. Everything is mostly additive and it would be pretty challenging to do any multiplicative transformations like I described before.
It is true that this tensor model has many more parameters and is much more vulnerable to overfitting, but I suppose with adequate regularization it is not a huge problem.
Best Answer
Suppose you have a set of $d$ nodes with linear activation and weights $b_{ij}$ for input $j$ to node $i$. If you can impose the constraints $\sum_j b_{ij}^2=1$ and $\sum_{j} b_{ij}b_{kj}=0$ for $i\neq k$ then the mapping from input to output is multiplying by an orthonormal matrix.
The orthonormal matrices form two connected sets: the rotations (determinant =1) and the rotations with reflection (determinant=-1). If you have a learning rate that isn't too high, your transformation won't be able to jump between these components, so if you start your weights off at rotation they'll stay a rotation.
This assumes you want rotations around the origin. To get rotations around some other point needs non-zero intercept (bias) terms chosen to move that point to the origin, rotate, then move it back.