Solved – Concatenation of weights in Neural Networks

backpropagationnatural languageneural networksweightsword embeddings

I'm training a special neural network for Natural Language Processing which has an embedding layer (this layer takes a one-hot vector of each word and output it's embedding vector through an embedding matrix). So, what I want to use is two embedding matrix for the same input. Formally, if we have $x \in \mathbb{R}^{1 \times n}$ as input and two embedding matrix $w_1 \in \mathbb{R}^{n \times d1}$ and $w_2 \in \mathbb{R}^{n \times d2}$ I want an output $e \in \mathbb{R}^{1 \times (d1+d2)}$ that contents the two embedding vectors. To do so, we can $x \cdot w_1$ and concatenate it with $x \cdot w_2$ and thus the NN can learn the weights separately. But what I thought is that I can create a "super"-embedding matrix $\in \mathbb{R}^{2n \times (d1+d2)}$ filling with 0's as: \begin{bmatrix} w_1 & 0\\ 0 & w_2\\ \end{bmatrix} and do the dot product with the concatenation of [$x$; $x$]. Finally, I got the same result but I have twice questions:

  • Which is the most complex (computationally)? The second is bigger but only require one dot product and the concatenation is before the layer.
  • ‌Does the weight filled with zeros change its value in the NN learning phase?

‌Thanks in advance!

Best Answer

Multiplying two matrices $A \in R^{m \times n}$ and $B \in R^{n \times p}$ includes $mnp$ scalar multiplications and $m(n-1)p$ additions. In total, it is $mp(2n-1)$ scalar operations.

In your first case you perform two matrix multiplications, which results in $d_1(2n-1)+d_2(2n-1)=(d_1+d_2)(2n-1)$ operations. In the second case, you perform only one matrix multiplication, but the involved matrices are bigger, and so there are $(d_1+d_2)(4n-1)$ operations. Roughly speaking and taking only computational steps into account, the first way is two times faster. However, it requires concatenation and the decision which way to choose depends on the cost of concatenating two vectors. Not knowing how it is implemented in your NN and the programming language that you use, I can say that in general, concatenation is of linear complexity ($O(n)$) and that copying $n$ variables from one array to another should not be as expensive as performing mathematical operations on them, so you should probably be going for the first way. But, if you can, try to avoid allocating new memory when concatenating. It would be wiser to allocate space for one array of size $2n$ and simply copy its first $n$ elements to its second half, than to allocate memory once for an array of size $n$ and other time for an array of size $2n$. Also, if you control allocation, avoid allocating memory for concatenating in each step of your NN. Allocation requires calls to the operating system, and it slows down your program. Take the memory you need before the NN starts and ensure that it is properly used. The C programming language gives you this level of control, but if these details are out of your control, the first way should still be faster.

The weights with zeros would change during training, but you can always postprocess this super-embedding matrix and ensure that it contains zeros in appropriate positions.

Related Question