Keep in mind why $\Theta_3$ is $4 \times 6$ rather than $4 \times 5$, even though the third layer has only $5$ nodes. It's because each node in the output layer takes the $5$ nodes as input plus an intercept. Remember that $\delta^{(3)}$ is the derivative of the error function with respect to each node in the third layer, prior to activation. One of your six $\delta^{(3)}$ components is the derivative with respect to the intercept, which has no dependence on any earlier part of the network, and thus has no further "backpropagating" to do. It's not even a relevant value to the calculation, because all you want is the derivative with respect to the weights that travel from the intercept to the outputs.
(I know it doesn't make sense to take a derivative with respect to a constant. However, what we're doing is treating the intercept as if it was an extra variable that always happens to have an observed value of 1. It's done that way for convenience, so we can place its weights in the same matrix as the other weights, rather than considering it separately.)
Thus in the second calculation you matrix multiply $\Theta_2^T$ with the five $\delta^{(3)}$ components that you care about. The ones corresponding to the actual nodes that take weight arguments from earlier in the network.
Let me try to tackle those questions one by one.
Before we start, let's ignore $\lambda$$\Theta^{l}_{ij}$ for now. The concepts are well understand without it, and you can tackle it after the rest feels clear.
Let's also pretend that bias terms don't exist.
Dimensions of $\Delta^{l}$: $\Delta^{l}$ is a matrix, and the dimensions of this matrix (assuming a fully connected neural net, which is what I think the tutorial is covering) is: $nrows$ = number of nodes in the next layer, and $ncolumns$ in the previous layer. Exceptions: For input layer #columns = #input features and output layer #rows=#output features. There's a confusing repetition of the letter i in the slide - it's used both to refer to iterating through examples $1$ to $m$ and to refer to an index of the $\Delta$ matrix/matrices. (Note you will sometimes see this matrix defined with the $nrows$ and $ncolumns$ swapped, i.e. the transpose. However your reference material doesn't seem to do that)
What would this look like for a 3 layered NN: I tend to think of it as 2 separate matrices $\Delta^{0}$ and $\Delta^{1}$. For a 3x3x3 NN, $\Delta^{0}$ would be 3x3 and $\Delta^{1}$ would be 3x3. If it was a 3x3x1 NN, $\Delta^{0}$ would be 3x3 but $\Delta^{1}$ would be 1x3 (I chose to index from 0, but you could index from 1), assuming the input is a column vector
Why the $\Delta$ is set to all 0 at the start: It's just to initialize. You haven't started calculating or "collecting the terms" to calculate the gradient yet, so you initialize to 0 before you start.
Significance of the updating: Back to the confusing repeated use of $i$. So we are passing every data point through the neural net, in every iteration of the loop going from $1$ to $m$. So in our first run through the loop, we only accumulate what we think is the gradient based on data point 1, $x^{(1)}$. But whoever bets the farm on 1 data point? So the next time through, we add $x^{(2)}$... and so on till we get to $x^{(m)}$ and exhaust our data.
But why is this $\Delta$ (after all the calculation) the gradient of the cost function with respect to the parameters?: Well you're taking the gradient of the error that each sample/data point has wrt to the parameters (each data point = one iteration through the for loop). All you're doing by adding is essentially averaging them all to get a better estimate of the gradient.
OK, but how are we deciding that adding up $a^{(l)}_j$$\delta^{(l+1)}_{i}$ turns into a gradient of $J(\Theta)$ (after dividing by $m$) : This one is tough to type up. To really understand I recommend penciling out a baby NN and working through it (doable if you have some, even rusty, calculus background). However, at this stage in the slides, I dont think you're expected to do that. The activation function isn't even given in the slide, which you need to actually do the derivation. You should be able to google for exercises others have blogged. For example: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ looks promising, though, full disclosure:I only leafed through quickly.
In summary
So after all this work, you have now done backprop once, and have the gradient of the cost functions with respect to the various parameters stored in $\Delta^{0}$ through $\Delta^{(L-1)}$ for a L layered fully connected NN.
Also, i did need to refer: https://www.coursera.org/learn/machine-learning/supplement/pjdBA/backpropagation-algorithm to answer. Note they are also assuming a specific activation function, and get into details on later slides.
Finally, I made an assumption at the start that bias terms don't exist, because then the dimensions are easier to see. You'll need to expand the matrices between each layer to consume the bias term as well, which is a more normal construct
Best Answer
The recursive formula shared in wikipedia is $$\delta^{l-1}=(f^{l-1})'\circ (W^l)^T \delta^l$$ This helps you propagate the derivatives back to the first layers. However, the initial condition for this equation is when $l=L$ ($L$ is the number of layers), and $\delta^L$ should be calculated separately. If you look a bit down on the page, you'll see this initial condition set as $$\delta^L=(f^L)'\circ \nabla_{a^L}C$$ where $a^L$ is the activated output of layer $L$ and $C$ is the cost function according to the definitions in the page. The last derivative term translates into $(t_k-z_k)$ in your equation (the sign differs but that's probably because the tutorial you mentioned adds the gradient instead of subtracting it). So, it is almost the same equation.