I believe that the key to answering this question is to point out that the element-wise multiplication is actually shorthand and therefore when you derive the equations you never actually use it.
The actual operation is not an element-wise multiplication but instead a standard matrix multiplication of a gradient with a Jacobian, always.
In the case of the nonlinearity, the Jacobian of the vector output of the non-linearity with respect to the vector input of the non-linearity happens to be a diagonal matrix. It's therefore true that the gradient multiplied by this matrix is equivalent to the gradient of the output of the nonlinearity with respect to the loss element-wise multiplied by a vector containing all the partial derivatives of the nonlinearity with respect to the input of the nonlinearity, but this follows from the Jacobian being diagonal. You must pass through the Jacobian step to get to the element-wise multiplication, which might explain your confusion.
In math, we have some nonlinearity $s$, a loss $L$, and an input to the nonlinearity $x \in \mathbb{R}^{n \times 1}$ (this could be any tensor). The output of the nonlinearity has the same dimension $s(x) \in \mathbb{R}^{n \times 1}$---as @Logan says, the activation function are defined as element-wise.
We want $$\nabla_{x}L=\left({\dfrac{\partial s(x)}{\partial x}}\right)^T\nabla_{s(x)}L$$
Where $\dfrac{\partial s(x)}{\partial x}$ is the Jacobian of $s$. Expanding this Jacobian, we get
\begin{bmatrix}
\dfrac{\partial{s(x_{1})}}{\partial{x_1}} & \dots & \dfrac{\partial{s(x_{1})}}{\partial{x_{n}}} \\
\vdots & \ddots & \vdots \\
\dfrac{\partial{s(x_{n})}}{x_{1}} & \dots & \dfrac{\partial{s(x_{n})}}{\partial{x_{n}}}
\end{bmatrix}
We see that it is everywhere zero except for the diagonal. We can make a vector of all its diagonal elements $$Diag\left(\dfrac{\partial s(x)}{\partial x}\right)$$
And then use the element-wise operator.
$$\nabla_{x}L
=\left({\dfrac{\partial s(x)}{\partial x}}\right)^T\nabla_{s(x)}L
=Diag\left(\dfrac{\partial s(x)}{\partial x}\right) \circ \nabla_{s(x)}L$$
I realized what may be missing is the number of filters in the layer.
Even though they don't have a letter for it in the table, the authors might be assuming implicitly that the order of magnitude of the number of filters is the same as that of the number of depth dimensions. Or even more simply, that the number of filters is equal to $d$ (in that case, the conv layer does not change the depth dimensionality).
So, in that case, the time complexity indeed amounts to $\mathcal{O}(k\cdot n \cdot d^2)$ because we're repeating the $\mathcal{O}(k\cdot n \cdot d)$ routine described in the question for each of the $d$ filters.
Best Answer
Multiplying two matrices $A \in R^{m \times n}$ and $B \in R^{n \times p}$ includes $mnp$ scalar multiplications and $m(n-1)p$ additions. In total, it is $mp(2n-1)$ scalar operations.
In your first case you perform two matrix multiplications, which results in $d_1(2n-1)+d_2(2n-1)=(d_1+d_2)(2n-1)$ operations. In the second case, you perform only one matrix multiplication, but the involved matrices are bigger, and so there are $(d_1+d_2)(4n-1)$ operations. Roughly speaking and taking only computational steps into account, the first way is two times faster. However, it requires concatenation and the decision which way to choose depends on the cost of concatenating two vectors. Not knowing how it is implemented in your NN and the programming language that you use, I can say that in general, concatenation is of linear complexity ($O(n)$) and that copying $n$ variables from one array to another should not be as expensive as performing mathematical operations on them, so you should probably be going for the first way. But, if you can, try to avoid allocating new memory when concatenating. It would be wiser to allocate space for one array of size $2n$ and simply copy its first $n$ elements to its second half, than to allocate memory once for an array of size $n$ and other time for an array of size $2n$. Also, if you control allocation, avoid allocating memory for concatenating in each step of your NN. Allocation requires calls to the operating system, and it slows down your program. Take the memory you need before the NN starts and ensure that it is properly used. The C programming language gives you this level of control, but if these details are out of your control, the first way should still be faster.
The weights with zeros would change during training, but you can always postprocess this super-embedding matrix and ensure that it contains zeros in appropriate positions.