I feel a little bit bad about providing my own answer for this because it is pretty well captured by amoeba and juampa, except for maybe the final intuition about how the Jacobian can be reduced back to a vector.
You correctly derived the gradient of the diagonal of the Jacobian matrix, which is to say that
$ {\partial h_i \over \partial z_j}= h_i(1-h_j)\;\;\;\;\;\;: i = j $
and as amoeba stated it, you also have to derive the off diagonal entries of the Jacobian, which yield
$ {\partial h_i \over \partial z_j}= -h_ih_j\;\;\;\;\;\;: i \ne j $
These two concepts definitions can be conveniently combined using a construct called the Kronecker Delta, so the definition of the gradient becomes
$ {\partial h_i \over \partial z_j}= h_i(\delta_{ij}-h_j) $
So the Jacobian is a square matrix $ \left[J \right]_{ij}=h_i(\delta_{ij}-h_j) $
All of the information up to this point is already covered by amoeba and juampa. The problem is of course, that we need to get the input errors from the output errors that are already computed. Since the gradient of the output error $\nabla h_i$ depends on all of the inputs, then the gradient of the input $x_i$ is
$[\nabla x]_k = \sum\limits_{i=1} \nabla h_{i,k} $
Given the Jacobian matrix defined above, this is implemented trivially as the product of the matrix and the output error vector:
$ \vec{\sigma_l} = J\vec{\sigma_{l+1}} $
If the softmax layer is your output layer, then combining it with the cross-entropy cost model simplifies the computation to simply
$ \vec{\sigma_l} = \vec{h}-\vec{t} $
where $\vec{t}$ is the vector of labels, and $\vec{h}$ is the output from the softmax function. Not only is the simplified form convenient, it is also extremely useful from a numerical stability standpoint.
Note: I am not an expert on backprop, but now having read a bit, I think the following caveat is appropriate. When reading papers or books on neural nets, it is not uncommon for derivatives to be written using a mix of the standard summation/index notation, matrix notation, and multi-index notation (include a hybrid of the last two for tensor-tensor derivatives). Typically the intent is that this should be "understood from context", so you have to be careful!
I noticed a couple of inconsistencies in your derivation. I do not do neural networks really, so the following may be incorrect. However, here is how I would go about the problem.
First, you need to take account of the summation in $E$, and you cannot assume each term only depends on one weight. So taking the gradient of $E$ with respect to component $k$ of $z$, we have
$$E=-\sum_jt_j\log o_j\implies\frac{\partial E}{\partial z_k}=-\sum_jt_j\frac{\partial \log o_j}{\partial z_k}$$
Then, expressing $o_j$ as
$$o_j=\tfrac{1}{\Omega}e^{z_j} \,,\, \Omega=\sum_ie^{z_i} \implies \log o_j=z_j-\log\Omega$$
we have
$$\frac{\partial \log o_j}{\partial z_k}=\delta_{jk}-\frac{1}{\Omega}\frac{\partial\Omega}{\partial z_k}$$
where $\delta_{jk}$ is the Kronecker delta. Then the gradient of the softmax-denominator is
$$\frac{\partial\Omega}{\partial z_k}=\sum_ie^{z_i}\delta_{ik}=e^{z_k}$$
which gives
$$\frac{\partial \log o_j}{\partial z_k}=\delta_{jk}-o_k$$
or, expanding the log
$$\frac{\partial o_j}{\partial z_k}=o_j(\delta_{jk}-o_k)$$
Note that the derivative is with respect to $z_k$, an arbitrary component of $z$, which gives the $\delta_{jk}$ term ($=1$ only when $k=j$).
So the gradient of $E$ with respect to $z$ is then
$$\frac{\partial E}{\partial z_k}=\sum_jt_j(o_k-\delta_{jk})=o_k\left(\sum_jt_j\right)-t_k \implies \frac{\partial E}{\partial z_k}=o_k\tau-t_k$$
where $\tau=\sum_jt_j$ is constant (for a given $t$ vector).
This shows a first difference from your result: the $t_k$ no longer multiplies $o_k$. Note that for the typical case where $t$ is "one-hot" we have $\tau=1$ (as noted in your first link).
A second inconsistency, if I understand correctly, is that the "$o$" that is input to $z$ seems unlikely to be the "$o$" that is output from the softmax. I would think that it makes more sense that this is actually "further back" in network architecture?
Calling this vector $y$, we then have
$$z_k=\sum_iw_{ik}y_i+b_k \implies \frac{\partial z_k}{\partial w_{pq}}=\sum_iy_i\frac{\partial w_{ik}}{\partial w_{pq}}=\sum_iy_i\delta_{ip}\delta_{kq}=\delta_{kq}y_p$$
Finally, to get the gradient of $E$ with respect to the weight-matrix $w$, we use the chain rule
$$\frac{\partial E}{\partial w_{pq}}=\sum_k\frac{\partial E}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}=\sum_k(o_k\tau-t_k)\delta_{kq}y_p=y_p(o_q\tau-t_q)$$
giving the final expression (assuming a one-hot $t$, i.e. $\tau=1$)
$$\frac{\partial E}{\partial w_{ij}}=y_i(o_j-t_j)$$
where $y$ is the input on the lowest level (of your example).
So this shows a second difference from your result: the "$o_i$" should presumably be from the level below $z$, which I call $y$, rather than the level above $z$ (which is $o$).
Hopefully this helps. Does this result seem more consistent?
Update: In response to a query from the OP in the comments, here is an expansion of the first step.
First, note that the vector chain rule requires summations (see here). Second, to be certain of getting all gradient components, you should always introduce a new subscript letter for the component in the denominator of the partial derivative. So to fully write out the gradient with the full chain rule, we have
$$\frac{\partial E}{\partial w_{pq}}=\sum_i \frac{\partial E}{\partial o_i}\frac{\partial o_i}{\partial w_{pq}}$$
and
$$\frac{\partial o_i}{\partial w_{pq}}=\sum_k \frac{\partial o_i}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}$$
so
$$\frac{\partial E}{\partial w_{pq}}=\sum_i \left[ \frac{\partial E}{\partial o_i}\left(\sum_k \frac{\partial o_i}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}\right) \right]$$
In practice the full summations reduce, because you get a lot of $\delta_{ab}$ terms. Although it involves a lot of perhaps "extra" summations and subscripts, using the full chain rule will ensure you always get the correct result.
Best Answer
I have not read the book you mentioned. But it seems you probably missed the point of backpropagation. Yes, you only need to measure the cost of the output usually, but in order to train your model (i.e., get the desirable weights and biases), you have to compute the gradients of the cost w.r.t. all the weights and biases. That is why you have the last equations. Now for example, you get: $$ {\partial C \over \partial b_j^L}={\partial C \over \partial z_j^L}{\partial z_j^L \over \partial b_j^L}=(\sum_k{\partial C \over \partial a_k^L}{\partial a_k^L \over \partial z_j^L}){\partial z_j^L \over \partial b_j^L} $$ where you know all the functions of $C(a), a(z)$, and $z(b)$. And you have, $$ {\partial C \over \partial a_k^L} = {\partial{-\ln{a_y^L}} \over \partial a_k^L} = \begin{cases} - {\frac 1 {a_y^L}}, &k=y \\ 0, & k!=y \end{cases} $$
$$ {\partial a_k^L \over \partial z_j^L} = \begin{cases} a_j^L(1-a_j^L), &j=k \\ -a_k^L a_j^L,&j!=k \end{cases} $$
$$ {\partial z_j^L \over \partial b_j^L} = 1 $$
Then you have, $$ {\partial C \over \partial b_j^L} = \begin{cases} a_j^L-1, &j=y \\ a_j^L,&j!=y \end{cases}, \space $$ i.e., $$ {\partial C \over \partial b_j^L} = a_j^L - 1_{j = y} = a_j^L - y_j $$