"When we apply the transpose weight matrix, $(w^{l+1})^T$, we can think intuitively of this as moving the error backward through the network"
Is that the sentence you're confused about? Note that in this sentence he's not trying to provide intuition for ''applying the transpose operator'', he's trying to provide intuition for ''aplplying the transpose weight matrix'', which basically means ''multiplying by the transpose weight matrix''. That entire act of multiplying the error vector with the transposed weight matrix, that is what can be understood intuitively as moving the error backward through the network. Not just the transpose operator on its own.
The reason for doing the transpose is simply to make the dimensions work out, that matrix needs to be transposed in order to guarantee that multiplying it with the error vector afterwards results in the correct dimensionality.
The reason why this application (~= multiplication) with the transposed weight matrix can be understood as moving the error backward through the network is kind of exactly the same intuition you probably already have for the forwards pass. In the forwards pass, a high/strong weight means a node in one layer has a large influence on the connected node in the next layer. Exactly the same intuition holds here for the backwards pass, because, just like in the forwards pass, we're multiplying something
with a weight. In the forwards pass, that something
is the activation level of a node in the previous layer. In the backwards pass, that something
is the error observed in a node in the later layer. If a particular connection is ''strong'' (weight has a high value), and we also have a large error, we should put a large amount of ''blame'' for the error on that large weight. This is what we get by multiplying by the (transposed) weight matrix; large weights take a large portion of the ''blame'' for large errors.
More Elaborate explanation for transpose:
Let's consider the forwards pass, following the notation from your link. For simplicity, let's ignore activation functions and biases for now, they're not really important for our intuition. Then, equation 25 in your link tells us that the activation vector $a^l$ of layer $l$ is defined as $a^l = w^l * a^{l-1}$. To make the notation a bit more consistent with the equation for the backwards pass, we'll rewrite this as $a^{l+1} = w^{l+1} * a^l$ (simply added $1$ to all layer indices).
$w^{l+1}$ is a matrix, and the $a$ things are vectors, so it's convenient to consider again what the matrix-vector multiplication looks like. Take a look at the image near the top here for example: http://mathinsight.org/matrix_vector_multiplication , copied at the bottom of this answer.
Let's consider the activation level of the very first node (at the top) of layer $l+1$. This would be the top element of the vector in the right-hand side of the equation on the page I linked. This one activation level is determined by the complete vector of activation levels of the entire previous layer, and the very first row of weights in the weight matrix.
Now suppose we have a vector $\delta^{l+1}$ of errors in layer $l+1$. Let's again consider only the top node of this layer. The activation level of this particular node was determined by the top row of the weight matrix $w^{l+1}$, and the entire vector of activation levels $a^l$. We'll want to "punish" each of the weights in that top row proportional to the magnitude of that weight and our error. So, to do that, we'll want some kind of multiplication between $w^{l+1}$ and $\delta^{l+1}$ (note that this is different from the forwards pass, in which we had a multiplication between the same weight matrix, but an activation vector from a different layer). This should again result in a vector with the same shape as $a^l$ (this is already one clue that we have to do the transpose, otherwise the shape simply won't be correct).
To figure out how this multiplication should look exactly, we'll have to investigate which weights exactly are to blame for which errors. The top row of weights in the matrix $w^{l+1}$ had (during the forwards pass) influence on only the top element of $a^{l+1}$, and is therefore now only to blame for the top element of the error vector $\delta^{l+1}$ and that entire top row of weights should therefore now also only be multiplied by the top element of $\delta^{l+1}$ in order to compute a part of $\delta^l$. In the notation of the image near the top of that page I linked, this means that we would like our top row of weights to be only multiplied $x_1$. But that's not what the picture of matrix-vector multiplication says, that picture says that elements of the top row of the matrix are multiplied with all $x$. If you look carefully, there is a vector in that matrix which is solely multiplied by the top element $x_1$ though, which is the very first column of the (weight matrix). By transposing the weight matrix, we interchange rows and columns, and we essentially get exactly the multiplications we desire.
Linked matrix-vector multiplication
$\begin{align*}
A{x}=
\left[
\begin{array}{cccc}
a_{11}& a_{12}& \ldots& a_{1n}\\
a_{21}& a_{22}& \ldots& a_{2n}\\
\vdots& \vdots& \ddots& \vdots\\
a_{m1}& a_{m2}& \ldots& a_{mn}
\end{array}
\right]
\left[
\begin{array}{c}
x_1\\
x_2\\
\vdots\\
x_n
\end{array}
\right]
=
\left[
\begin{array}{c}
a_{11}x_1+a_{12}x_2 + \cdots + a_{1n} x_n\\
a_{21}x_1+a_{22}x_2 + \cdots + a_{2n} x_n\\
\vdots\\
a_{m1}x_1+a_{m2}x_2 + \cdots + a_{mn} x_n\\
\end{array}
\right]
\end{align*}
$
Best Answer
Your calculation is correct. By the way, there are free software packages available that make it easier to do these kinds of calculations. For example, using Octave, you simply type
and it will give you the answer: