Backprop is used to compute the gradient of the loss function--that is, a vector containing the partial derivative of the loss function with respect to each parameter of the network. The mechanics of backprop are equivalent to the chain rule from calculus.
The gradient is used to update the weights according to some learning rule, whose job is reduce the value of the loss function. Many learning rules are possible, but one of the simplest and most widely used is gradient descent. The gradient at each point in parameter space is a vector that points in the direction in which the loss function increases most steeply. At each iteration, gradient descent takes a step in the direction opposite the gradient--that is, it steps in the direction of steepest descent, thereby reducing the loss function.
Let me try to tackle those questions one by one.
Before we start, let's ignore $\lambda$$\Theta^{l}_{ij}$ for now. The concepts are well understand without it, and you can tackle it after the rest feels clear.
Let's also pretend that bias terms don't exist.
Dimensions of $\Delta^{l}$: $\Delta^{l}$ is a matrix, and the dimensions of this matrix (assuming a fully connected neural net, which is what I think the tutorial is covering) is: $nrows$ = number of nodes in the next layer, and $ncolumns$ in the previous layer. Exceptions: For input layer #columns = #input features and output layer #rows=#output features. There's a confusing repetition of the letter i in the slide - it's used both to refer to iterating through examples $1$ to $m$ and to refer to an index of the $\Delta$ matrix/matrices. (Note you will sometimes see this matrix defined with the $nrows$ and $ncolumns$ swapped, i.e. the transpose. However your reference material doesn't seem to do that)
What would this look like for a 3 layered NN: I tend to think of it as 2 separate matrices $\Delta^{0}$ and $\Delta^{1}$. For a 3x3x3 NN, $\Delta^{0}$ would be 3x3 and $\Delta^{1}$ would be 3x3. If it was a 3x3x1 NN, $\Delta^{0}$ would be 3x3 but $\Delta^{1}$ would be 1x3 (I chose to index from 0, but you could index from 1), assuming the input is a column vector
Why the $\Delta$ is set to all 0 at the start: It's just to initialize. You haven't started calculating or "collecting the terms" to calculate the gradient yet, so you initialize to 0 before you start.
Significance of the updating: Back to the confusing repeated use of $i$. So we are passing every data point through the neural net, in every iteration of the loop going from $1$ to $m$. So in our first run through the loop, we only accumulate what we think is the gradient based on data point 1, $x^{(1)}$. But whoever bets the farm on 1 data point? So the next time through, we add $x^{(2)}$... and so on till we get to $x^{(m)}$ and exhaust our data.
But why is this $\Delta$ (after all the calculation) the gradient of the cost function with respect to the parameters?: Well you're taking the gradient of the error that each sample/data point has wrt to the parameters (each data point = one iteration through the for loop). All you're doing by adding is essentially averaging them all to get a better estimate of the gradient.
OK, but how are we deciding that adding up $a^{(l)}_j$$\delta^{(l+1)}_{i}$ turns into a gradient of $J(\Theta)$ (after dividing by $m$) : This one is tough to type up. To really understand I recommend penciling out a baby NN and working through it (doable if you have some, even rusty, calculus background). However, at this stage in the slides, I dont think you're expected to do that. The activation function isn't even given in the slide, which you need to actually do the derivation. You should be able to google for exercises others have blogged. For example: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ looks promising, though, full disclosure:I only leafed through quickly.
In summary
So after all this work, you have now done backprop once, and have the gradient of the cost functions with respect to the various parameters stored in $\Delta^{0}$ through $\Delta^{(L-1)}$ for a L layered fully connected NN.
Also, i did need to refer: https://www.coursera.org/learn/machine-learning/supplement/pjdBA/backpropagation-algorithm to answer. Note they are also assuming a specific activation function, and get into details on later slides.
Finally, I made an assumption at the start that bias terms don't exist, because then the dimensions are easier to see. You'll need to expand the matrices between each layer to consume the bias term as well, which is a more normal construct
Best Answer
The difference is that in Andrew Ng's video, the Logistic regression cost function is used.
Conversely, the other source uses the Squared Error regression cost function.
The derivation of the delta term using the Logistic regression function is as follows: