Note: I am not an expert on backprop, but now having read a bit, I think the following caveat is appropriate. When reading papers or books on neural nets, it is not uncommon for derivatives to be written using a mix of the standard summation/index notation, matrix notation, and multi-index notation (include a hybrid of the last two for tensor-tensor derivatives). Typically the intent is that this should be "understood from context", so you have to be careful!
I noticed a couple of inconsistencies in your derivation. I do not do neural networks really, so the following may be incorrect. However, here is how I would go about the problem.
First, you need to take account of the summation in $E$, and you cannot assume each term only depends on one weight. So taking the gradient of $E$ with respect to component $k$ of $z$, we have
$$E=-\sum_jt_j\log o_j\implies\frac{\partial E}{\partial z_k}=-\sum_jt_j\frac{\partial \log o_j}{\partial z_k}$$
Then, expressing $o_j$ as
$$o_j=\tfrac{1}{\Omega}e^{z_j} \,,\, \Omega=\sum_ie^{z_i} \implies \log o_j=z_j-\log\Omega$$
we have
$$\frac{\partial \log o_j}{\partial z_k}=\delta_{jk}-\frac{1}{\Omega}\frac{\partial\Omega}{\partial z_k}$$
where $\delta_{jk}$ is the Kronecker delta. Then the gradient of the softmax-denominator is
$$\frac{\partial\Omega}{\partial z_k}=\sum_ie^{z_i}\delta_{ik}=e^{z_k}$$
which gives
$$\frac{\partial \log o_j}{\partial z_k}=\delta_{jk}-o_k$$
or, expanding the log
$$\frac{\partial o_j}{\partial z_k}=o_j(\delta_{jk}-o_k)$$
Note that the derivative is with respect to $z_k$, an arbitrary component of $z$, which gives the $\delta_{jk}$ term ($=1$ only when $k=j$).
So the gradient of $E$ with respect to $z$ is then
$$\frac{\partial E}{\partial z_k}=\sum_jt_j(o_k-\delta_{jk})=o_k\left(\sum_jt_j\right)-t_k \implies \frac{\partial E}{\partial z_k}=o_k\tau-t_k$$
where $\tau=\sum_jt_j$ is constant (for a given $t$ vector).
This shows a first difference from your result: the $t_k$ no longer multiplies $o_k$. Note that for the typical case where $t$ is "one-hot" we have $\tau=1$ (as noted in your first link).
A second inconsistency, if I understand correctly, is that the "$o$" that is input to $z$ seems unlikely to be the "$o$" that is output from the softmax. I would think that it makes more sense that this is actually "further back" in network architecture?
Calling this vector $y$, we then have
$$z_k=\sum_iw_{ik}y_i+b_k \implies \frac{\partial z_k}{\partial w_{pq}}=\sum_iy_i\frac{\partial w_{ik}}{\partial w_{pq}}=\sum_iy_i\delta_{ip}\delta_{kq}=\delta_{kq}y_p$$
Finally, to get the gradient of $E$ with respect to the weight-matrix $w$, we use the chain rule
$$\frac{\partial E}{\partial w_{pq}}=\sum_k\frac{\partial E}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}=\sum_k(o_k\tau-t_k)\delta_{kq}y_p=y_p(o_q\tau-t_q)$$
giving the final expression (assuming a one-hot $t$, i.e. $\tau=1$)
$$\frac{\partial E}{\partial w_{ij}}=y_i(o_j-t_j)$$
where $y$ is the input on the lowest level (of your example).
So this shows a second difference from your result: the "$o_i$" should presumably be from the level below $z$, which I call $y$, rather than the level above $z$ (which is $o$).
Hopefully this helps. Does this result seem more consistent?
Update: In response to a query from the OP in the comments, here is an expansion of the first step.
First, note that the vector chain rule requires summations (see here). Second, to be certain of getting all gradient components, you should always introduce a new subscript letter for the component in the denominator of the partial derivative. So to fully write out the gradient with the full chain rule, we have
$$\frac{\partial E}{\partial w_{pq}}=\sum_i \frac{\partial E}{\partial o_i}\frac{\partial o_i}{\partial w_{pq}}$$
and
$$\frac{\partial o_i}{\partial w_{pq}}=\sum_k \frac{\partial o_i}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}$$
so
$$\frac{\partial E}{\partial w_{pq}}=\sum_i \left[ \frac{\partial E}{\partial o_i}\left(\sum_k \frac{\partial o_i}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}\right) \right]$$
In practice the full summations reduce, because you get a lot of $\delta_{ab}$ terms. Although it involves a lot of perhaps "extra" summations and subscripts, using the full chain rule will ensure you always get the correct result.
The final matrix
is already a matrix of derivatives $\frac{\partial y}{\partial z}$. Every element $i, j$ of the matrix correspond to the single derivative of form $\frac{\partial y_i}{\partial z_j}$.
One usually expects to compute gradients for the backpropagation algorithm but those can be computed only for scalars. In this case the $y$ is a vector hence we stack the gradients of it's components (a single component of $y$ vector is a scalar) forming a Jacobian matrix:
$$
\frac{\partial y}{\partial z} =
\begin{bmatrix}
\frac{\partial y_1}{\partial z} \\
\frac{\partial y_2}{\partial z} \\
\frac{\partial y_3}{\partial z}
\end{bmatrix} =
\begin{bmatrix}
\frac{\partial y_1}{\partial z_1} & \frac{\partial y_1}{\partial z_2} & \frac{\partial y_1}{\partial z_3} \\
\frac{\partial y_2}{\partial z_1} & \frac{\partial y_2}{\partial z_2} & \frac{\partial y_2}{\partial z_3} \\
\frac{\partial y_3}{\partial z_1} & \frac{\partial y_3}{\partial z_2} & \frac{\partial y_3}{\partial z_3}
\end{bmatrix} =
\begin{bmatrix}y_1(1-y_1) & -y_1y_2 & -y_1y_3 \\-y_1y_2 & y_2(1-y_2) & -y_2y_3\\-y_1y_3 &-y_2y_3 & y_3(1-y_3)\end{bmatrix}
$$
Note however that in backpropagation we would usually compute the gradient of some loss function $L$. Since it is a scalar we can compute it's gradient wrt. $z$:
$$
\frac{\partial L}{\partial z} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial z}
$$
The component $\frac{\partial L}{\partial y}$ is a gradient (i.e. vector) which should be computed in the previous step of the backpropagation and depends on the actual loss function form (e.g. cross-entropy or MSE). The second component is the matrix shown above. By multiplying the vector $\frac{\partial L}{\partial y}$ by the matrix $\frac{\partial y}{\partial x}$ we get another vector $\frac{\partial L}{\partial x}$ which is suitable for another backpropagation step.
Note that the formula for $\frac{\partial L}{\partial z}$ might be a little difficult to derive in the vectorized form as shown above. It should be easier to start by computing the derivatives for single elements and vectorizing the equation later:
$$
\frac{\partial L}{\partial z_j} = \sum_i \frac{\partial L}{\partial y_i} \frac{\partial y_i}{\partial z_j}
$$
The above equation for a single element of $\frac{\partial L}{\partial z}$ is equivalent to the vectorized form above. One could convince himself/herself by comparing this equation to the way the vector-matrix multiplication works.
Best Answer
The last hidden layer produces output values forming a vector $\vec x = \mathbf x$. The output neuronal layer is meant to classify among $K=1,\dots,k$ categories with a SoftMax activation function assigning conditional probabilities (given $\mathbf x$) to each one the $K$ categories. In each node in the final (or ouput) layer the pre-activated values (logit values) will consist of the scalar products $\mathbf{w}_j^\top\mathbf{x}$, where $\mathbf w_j\in\{\mathbf{w}_1, \mathbf{w}_2,\dots,\mathbf{w}_k\}$. In other words, each category, $k$ will have a different vector of weights pointing at it, determining the contribution of each element in the output of the previous layer (including a bias), encapsulated in $\mathbf x$. However, the activation of this final layer will not take place element-wise (as for example with a sigmoid function in each neuron), but rather through the application of a SoftMax function, which will map a vector in $\mathbb R^k$ to a vector of $K$ elements in [0,1]. Here is a made-up NN to classify colors:
Defining the softmax as
$$ \sigma(j)=\frac{\exp(\mathbf{w}_j^\top \mathbf x)}{\sum_{k=1}^K \exp(\mathbf{w}_k^\top\mathbf x)}=\frac{\exp(z_j)}{\sum_{k=1}^K \exp(z_k)}$$
We want to get the partial derivative with respect to a vector of weights $(\mathbf w_i)$, but we can first get the derivative of $\sigma(j)$ with respect to the logit, i.e. $z_i = \mathbf w_i^\top \cdot \mathbf x$:
$$\begin{align} \small{\frac{\partial}{\partial( \mathbf{w}_i^\top \mathbf x)}}\sigma(j) &= \small{\frac{\partial}{\partial \left(\mathbf{w}_i^\top \mathbf x\right)}}\;\frac{\exp(\mathbf{w}_j^\top \mathbf x)}{\sum_{k=1}^K \exp(\mathbf{w}_k^\top\mathbf x)} \\[2ex] &\underset{*}{=} \frac{\frac{\partial}{\partial (\mathbf{w_i\top \mathbf x)}}\,\exp(\mathbf{w}_j^\top \mathbf x)}{\sum_{k=1}^K \exp(\mathbf{w}_k^\top\mathbf x)}\,-\,\frac{\exp(\mathbf w_j^\top \mathbf x)}{\left(\sum_{k=1}^K \exp(\mathbf{w}_k^\top\mathbf x) \right)^2}\quad\small{{\frac{\partial}{\partial \left(\mathbf w_i^\top\mathbf x\right)}}}\,\sum_{k=1}^K \exp(\mathbf{w}_k^\top\mathbf x)\\[2ex] &= \frac{\delta_{ij}\exp(\mathbf{w}_j^\top \mathbf x)}{\sum_{k=1}^K \exp(\mathbf{w}_k^\top\mathbf x)}\,-\,\frac{\exp(\mathbf w_j^\top \mathbf x)}{ \sum_{k=1}^K \exp\left(\mathbf{w}_k^\top\mathbf x \right)} \frac{\exp(\mathbf{w}_i^\top\mathbf x)}{\sum_{k=1}^K \exp\left(\mathbf{w}_k^\top\mathbf x \right)} \\[3ex] &=\sigma(j)\left(\delta_{ij}-\sigma(i)\right) \end{align}$$
$* \text{- quotient rule}$
Thanks and (+1) to Yuntai Kyong for pointing out that there was a forgotten index in the prior version of the post, and the changes in the denominator of the softmax had been left out of the following chain rule...
By the chain rule,
$$\begin{align}\frac{\partial}{\partial \mathbf{w}_i}\sigma(j)&= \sum_{k = 1}^K \frac{\partial}{\partial (\mathbf{w}_k^\top \mathbf x)}\sigma(j)\quad \frac{\partial}{\partial\mathbf{w}_i}\mathbf{w}_k^\top \mathbf{x}\\[2ex] &=\sum_{k = 1}^K \frac{\partial}{\partial (\mathbf{w}_k^\top \mathbf x)}\;\sigma(j)\quad \delta_{ik} \mathbf{x}\\[2ex] &=\sum_{k = 1}^K\sigma(j)\left(\delta_{kj}-\sigma(k)\right)\quad \delta_{ik} \mathbf{x} \end{align}$$
Combining this result with the previous equation:
$$\bbox[8px, border: 2px solid lime]{\frac{\partial}{\partial \mathbf{w}_i}\sigma(j)=\sigma(j)\left(\delta_{ij}-\sigma(i)\right)\mathbf x}$$