Backpropagation algorithm IS gradient descent and the reason it is usually restricted to first derivative (instead of Newton which requires hessian) is because the application of chain rule on first derivative is what gives us the "back propagation" in the backpropagation algorithm. Now, Newton is problematic (complex and hard to compute), but it does not stop us from using Quasi-newton methods especially BFGS (I believe many neural network software packages already use BFGS as part of their training these days).
As for fixed learning rate, it need not be fixed at all. There are papers far back as '95 reporting on this (Search for "adaptive learning rate backpropagation").
As much as I love Clifford / geometric algebra, I don't think it is of use here. CA allows you to deal in a basis-free way with multidimensional subspaces as algebraic objects, making various complicated derivations more elegant and transparent. In the case of neural nets, however, we really are dealing with a plain vector of parameters. If $\theta \in \mathbb{R}^D$ is a list of parameters, it is indeed a vector because the sum of two parameter vectors is again a valid parameter vector and a scalar times a parameter vector is also a valid parameter vector. This vector does not naturally possess any additional structure that would make it into a multivector.
However, standard Linear Algebra, as practiced by mathematicians, does allow you to do basis-free computations with vectors. This can indeed make the derivation of backpropagation more transparent.
The essence of backprop
Basically, the backpropagation algorithm is an efficient way to compute the gradient of a function that is the composition of a number of functions:
$$ L_\theta(x) = (f_N \circ f_{N-1} \circ \cdots \circ f_1)(x)$$
$L$ is called the loss, and it is just a chain of functions $f_i : \mathbb{R}^{D_{i-1}} \times \mathbb{R}^{P_i} \rightarrow \mathbb{R}^{D_i}$, of a $P_i$ dimensional parameter vector $\theta_i \in \mathbb{R}^{P_i}$ and the output $f_{i-1} \in \mathbb{R}^{D_{i-1}}$ of the previous layer.
Without choosing a basis, we see (using the total derivative) that the gradient of $L$ wrt the whole parameter vector $\theta$ is
$$
\begin{equation}
\begin{aligned}
\frac{d}{d \theta} f_N \circ \cdots \circ f_1 (x) \;
&= \frac{\partial f_N}{\partial \theta_N} \frac{d \theta_N}{d \theta} + \frac{\partial f_N}{\partial f_{N-1}} \frac{d f_{N-1}}{d \theta} \\
&=
\frac{\partial f_N}{\partial \theta_N} \frac{d \theta_N}{d \theta}
+ \frac{\partial f_N}{\partial f_{N-1}} \frac{\partial f_{N-1}}{\partial \theta_{N-1}} \frac{d \theta_{N-1}}{d\theta}
+
\frac{\partial f_N}{\partial f_{N-1}} \frac{\partial f_{N-1}}{\partial f_{N-2}} \frac{d f_{N-2}}{d\theta}\\
&= \cdots
\end{aligned}
\end{equation}
$$
Do you see the pattern developing? You get products of the form $\frac{\partial f_N}{\partial f_{N-1}} \frac{\partial f_{N-1}}{\partial f_{N-2}} \cdots \frac{d f_{2}}{d f_1}$. Each factor in that product requires the output of the previous function (layer), so a dumb algorithm would recompute $f_i$ for each term that requires it. A better approach is to do a "forward pass" where we compute $f_1(x)$, remember it, then use it to compute $f_2(f_1(x))$, remember it, and so on.
Then there is the question of how to compute $\frac{\partial f_N}{\partial f_{N-1}} \frac{\partial f_{N-1}}{\partial f_{N-2}} \cdots \frac{d f_{2}}{d f_1}$ once we have all the values of $f_i$. Once we choose a basis (and we eventually have to in order to get numerical values), the factors $\frac{\partial f_{i}}{\partial f_{i-1}}$ are matrices, so we're talking about a big matrix multiplication. For example, if a mid-layer maps $\mathbb{R}^{1000}$ to $\mathbb{R}^{5000}$, then $\frac{\partial f_{i}}{\partial f_{i-1}}$ is a $5000 \times 1000$ matrix. The key insight of backprop is that the whole function $L$ has signature $L : \mathbb{R}^{D_0} \rightarrow \mathbb{R}$, so its much better to compute
$$
\left(\left(\left( \frac{\partial f_N}{\partial f_{N-1}} \cdots \frac{\partial f_{3}}{\partial f_2} \right) \frac{\partial f_{2}}{\partial f_1} \right) \frac{d f_1}{d \theta}\right)
$$
than
$$
\left(\frac{\partial f_N}{\partial f_{N-1}} \cdots \left(\frac{\partial f_{3}}{\partial f_2} \left(\frac{\partial f_{2}}{\partial f_1} \frac{d f_1}{d \theta}\right)\right)\right)
$$
The reason is that in the first case, we perform a sequence of matrix-vector multiplications. The result of the vector-matrix product $\frac{\partial f_{N}}{\partial f_{N-1}} \frac{\partial f_{N-1}}{\partial f_{N-2}}$ is again a vector, which we then multiply by $\frac{\partial f_{N-2}}{\partial f_{N-3}}$, and so on. The cost of each multiplication is $O(D^2)$. For the second case, one computes matrix-matrix products, which has complexity $O(D^3)$.
So we have established that backprop is the exploitation of associativity. In this light backprop seems like a trivial algorithm (compute the product in the right order..), but back in the 80s you could write a paper about it and get very famous :)
Best Answer
It works because we combine three things:
A Neural network is a series of linear combinations of 2). By linearity of the differentiation operation we can apply the chain rule on such linear combinations.