Thinking in terms of directional derivatives might be more instructive in this case, as we can arrive at the chain rule formulation in a constructive fashion.
Let us consider the directional derivative of a multivariate function $f : \mathbb{R^n} \rightarrow \mathbb{R}$, in an arbitrary direction $u \in \mathbb{R^n}$.
Since $u$ is a direction, we shall assume $||\textbf{u}|| = 1$. The directional derivative in the direction of $\textbf{u}$, is then defined as
$$\nabla_u f = \nabla f \cdot \textbf{u}$$
This can easily be proved by noting, that under the usual limit definition of a derivative, we have
$$\nabla_u f (\textbf{x}) = \lim_{h\rightarrow0} \frac{f(\textbf{x} + h\textbf{u}) - f(\textbf{x})}{h }$$
Since we assume differentiability of the objective function $f$, we can find a linear approximant of $f$ around any arbitrary point $\textbf{a}$ that is close to the true value of $f(\textbf{x})$ in any $\epsilon$-neighborhood of $\textbf{a}$
$$f(\textbf{x}) = f(\textbf{a}) + \nabla f(\textbf{a})^T \cdot (\textbf{x} - \textbf{a})$$
Plugging this approximant into our previous limit formulation we get, for any $\textbf{a}$
$$\nabla_u f (\textbf{a}) = \nabla f(\textbf{a}) \cdot \textbf{u}$$
Going back to the original problem, when differentiating the cost function $E_n$ by an arbitrary parameter $a_k$ we also need to take into account the perturbations induced by a change in $a_k$ in any other parameters it interacts with. Specifically, when altering $a_k$ we shall also move along the $\textbf{direction}$ of the perturbed parameters.
This direction essentially encapsulates all the changes caused in intermediate variables $\{a_k\}_{k=1}^{n}$ that appear when altering a certain $a_j$. As such, for every $a_k$ that is directly influenced by $a_j$, we can measure the actual change by evaluating $\frac{\partial a_k}{\partial a_j}$.
Let $\textbf{p}$ be the aforementioned vector, therefore we can write
$$\textbf{p}_k = \bigg(\frac{\partial a_k}{\partial a_j}\bigg)$$
where $k$ runs through all parameters $a_k$ such that $a_k $ is influenced by $a_j$.
Coupling the directional derivative with the previously described concept, we arrive at the desired result, namely
$$\frac{\partial E_n}{\partial a_j} = \nabla E_n ^T \cdot \textbf p = \sum_{k \in \mathcal{S}} \frac{\partial E_n}{\partial a_k} \frac{\partial a_k}{\partial a_j}$$
where $\mathcal{S}$ is the set of all indices corresponding to variables directly influenced by $a_j$.
First, since your cost function is using the binary cross-entropy error $\mathcal{H}$ with a sigmoid activation $\sigma$, you can see that:
\begin{align}
\frac{\partial J}{\partial h_\theta}
&= \frac{1}{m}\sum_i\sum_k\frac{\partial }{\partial h_\theta}\mathcal{H}\left(y_k^{(i)},h_\theta(x^{(i)})_k\right) \\
&= \frac{1}{m}\sum_i\sum_k \left[ \frac{-y_k^{(i)}}{h_\theta(x^{(i)})_k} + \frac{1-y_k^{(i)}}{1-h_\theta(x^{(i)})_k} \right] \\
&= \frac{1}{m}\sum_i\sum_k \frac{h_\theta(x^{(i)})_k - y_k^{(i)}}{ h_\theta(x^{(i)})_k(1-h_\theta(x^{(i)})_k) }
\end{align}
Hence, for $m=K=1$, as a commenter notes $$ \frac{\partial J}{\partial h_\theta}
= \frac{h_\theta - y}{ h_\theta(1-h_\theta) } $$
But this is not so useful, as it computes how the error changes as the final output changes. What you really want is how the cost changes as the weights $\theta^{(\ell)}_{ij}$ are varied, so you can do gradient descent on them.
An intermediate calculation is to compute the variation with respect to the activation $ h_\theta=\sigma(z)$.
Let the last layer be $s$. Then the output layer error is:
\begin{align}
\delta^{(s)}_j
&= \frac{\partial J}{\partial z_j^{(s)}}\\
&= \frac{1}{m}\sum_i\sum_k \frac{\partial }{\partial z_j^{(s)}} \mathcal{H}\left(y_k^{(i)},h_\theta(x^{(i)})_k\right) \\
&= \frac{-1}{m}\sum_i\sum_k y_k^{(i)} \frac{1}{h_\theta(x^{(i)})_k}\frac{\partial h_\theta(x^{(i)})_k}{\partial z_j^{(s)}} + (1-y_k^{(i)})\frac{1}{1-h_\theta(x^{(i)})_k}\frac{\partial h_\theta(x^{(i)})_k}{\partial z_j^{(s)}} \\
&= \frac{-1}{m}\sum_i\sum_k [1-h_\theta(x^{(i)})_k]y_k^{(i)} - h_\theta(x^{(i)})_k[1-y_k^{(i)}]\\
&= \frac{1}{m}\sum_i\sum_k h_\theta(x^{(i)})_k -y_k^{(i)}
\end{align}
using the fact that
$$
\frac{\partial h_\theta(x^{(i)})_k}{\partial z_j^{(s)}}
= \sigma'(z_j^{(s)})
= \sigma(z_j^{(s)})[1-\sigma(z_j^{(s)})]
= h_\theta(x^{(i)})_k[1-h_\theta(x^{(i)})_k]
$$
So in the case that $m=K=1$ and $s=3$, we have:
$$
\delta^{(3)} = h_\theta - y
$$
Best Answer
Unfortunately, there are 2 common convention in matrix calculus: Jacobian (Numerator) and Gradient (Denominator).
For a function $f:\mathbb R^n \to \mathbb R^m$, then
Jacobian convention: $\frac{\partial f}{\partial x}$ is represented by the $m\times n$ matrix with entries $\big(\tfrac{\partial f}{\partial x}\big)_{ij} = \tfrac{\partial f_i}{\partial x_j}$
Gradiant convention: $\frac{\partial f}{\partial x}$ is represented by the $n\times m$ matrix with entries $\big(\tfrac{\partial f}{\partial x}\big)_{ij} = \tfrac{\partial f_j}{\partial x_i}$
If $f(x) = Ax$ then one can easily check that
Under Jacobian convention $\frac{\partial f}{\partial x} = A$
Under Gradient convention $\frac{\partial f}{\partial x} = A^T$
So it is a matter of which convention you are following.