I believe that the key to answering this question is to point out that the element-wise multiplication is actually **shorthand** and therefore when you derive the equations you *never* actually use it.

The actual operation is not an element-wise multiplication but instead a standard matrix multiplication of a gradient with a Jacobian, **always**.

In the case of the nonlinearity, the Jacobian of the vector output of the non-linearity with respect to the vector input of the non-linearity happens to be a diagonal matrix. It's therefore true that the gradient multiplied by this matrix is equivalent to the gradient of the output of the nonlinearity with respect to the loss element-wise multiplied by a vector containing all the partial derivatives of the nonlinearity with respect to the input of the nonlinearity, but this *follows* from the Jacobian being diagonal. You must pass through the Jacobian step to get to the element-wise multiplication, which might explain your confusion.

In math, we have some nonlinearity $s$, a loss $L$, and an input to the nonlinearity $x \in \mathbb{R}^{n \times 1}$ (this could be any tensor). The output of the nonlinearity has the same dimension $s(x) \in \mathbb{R}^{n \times 1}$---as @Logan says, the activation function are defined as element-wise.

We want $$\nabla_{x}L=\left({\dfrac{\partial s(x)}{\partial x}}\right)^T\nabla_{s(x)}L$$

Where $\dfrac{\partial s(x)}{\partial x}$ is the Jacobian of $s$. Expanding this Jacobian, we get
\begin{bmatrix}
\dfrac{\partial{s(x_{1})}}{\partial{x_1}} & \dots & \dfrac{\partial{s(x_{1})}}{\partial{x_{n}}} \\
\vdots & \ddots & \vdots \\
\dfrac{\partial{s(x_{n})}}{x_{1}} & \dots & \dfrac{\partial{s(x_{n})}}{\partial{x_{n}}}
\end{bmatrix}

We see that it is everywhere zero except for the diagonal. We can make a vector of all its diagonal elements $$Diag\left(\dfrac{\partial s(x)}{\partial x}\right)$$

And then use the element-wise operator.

$$\nabla_{x}L
=\left({\dfrac{\partial s(x)}{\partial x}}\right)^T\nabla_{s(x)}L
=Diag\left(\dfrac{\partial s(x)}{\partial x}\right) \circ \nabla_{s(x)}L$$

I think the paper defines the joint distribution (not the conditional distribution!) as

$$p_{ij} = \frac{\exp(-||x_{i} - x_{j}||/2\sigma^{2})}{\sum_{k \neq l}{\exp(-||x_{k} - x_{l}||/2\sigma^{2})}},$$

**but they do not use it** and instead define $$p_{ij}=\frac{p_{j|i}+p_{i|j}}{2}.$$

As mentioned in the paper the original SNE and tSNE differ in two respects:

The cost function used by t-SNE differs from the one used by SNE in two ways: (1) it uses a
symmetrized version of the SNE cost function with simpler gradients that was briefly introduced by
Cook et al. (2007) and (2) it uses a Student-t distribution rather than a Gaussian to compute the similarity
between two points in the low-dimensional space. t-SNE employs a heavy-tailed distribution
in the low-dimensional space to alleviate both the crowding problem and the optimization problems
of SNE.

**Update based on the question edit**: The denominator in both cases is just the normalization to ensure that summation over i(p(j/i) and summation over i&j(p(i,j) sum to 1, the basic requirement for both to be distributions.

Also since there is one Gaussian here, we take sigma as it's standard deviation. In the first case there were i Gaussian, and we could have taken a common standard deviation, but instead we chose to make sigma dependent on the density of neighbors around a point. If a point has a large number of neighbors around it within distance x, the conditional distribution should drop faster, as compared to conditional distribution for points in sparser regions.

## Best Answer

I just signed up for this forum due to your question :)

Nice question! It shows someone is indeed trying to follow & derive the nitty gritty. Your question is totally valid, (28) is indeed missing the $d_{ij}$, but then (24) is missing a $d_{ij}^{-1}$, you can see that from (21) via $\frac{\partial d_{ij}}{\partial y_i}$, taking into account that $$\frac{\partial \lVert\mathbf x\rVert}{\partial x_i} = \frac{x_i}{\lVert\mathbf x\rVert}.$$

So at the end everything is correct again! :P