Gradient descent maximizes a function using knowledge of its derivative. Newton's method, a root finding algorithm, maximizes a function using knowledge of its second derivative. That can be faster when the second derivative is known and easy to compute (the Newton-Raphson algorithm is used in logistic regression). However, the analytic expression for the second derivative is often complicated or intractable, requiring a lot of computation. Numerical methods for computing the second derivative also require a lot of computation -- if $N$ values are required to compute the first derivative, $N^2$ are required for the second derivative.
In many mathematical applications, the motivation becomes clearer after deriving the result. So let's start off with the algebra.
Suppose we were to run GD for $T$ iterations. This will give us the set ${(w_k)}_{k=1}^T$.
Let's do a change of basis:
$w^k = Qx^k + w^*$ $\iff$ $x^k = Q^T(w^k-w^*) $
Now we have ${(x_k)}_{k=1}^T$. What can we say about them? Let's look at each coordinate separately. By substituting the above and using the update step of GD,
$x_i^{k+1}= (Q^T(w^{k+1}-w^*))_i = (Q^T(w^k-\alpha (Aw^k-b)-w^*))_i $
Arranging,
$x_i^{k+1}=(Q^T(w^k-w^*))_i-\alpha \cdot (Q^T(Aw^k-b))_i$
The first term is exactly $x_i^k$. For the second term, we substitute $A=Qdiag(\lambda _1 \dots \lambda _n)Q^T$. This yields,
$x_i^{k+1}=x_i^k-\alpha \lambda _i x_i^k=(1-\alpha \lambda _i)x_i^k$
Which was a single step. Repeating until we get all the way to $x_0$, we get
$x_i^{k+1}=(1-\alpha \lambda _i)^{k+1}x_i^0$
All this seems really useless at this point. Let's go back to our initial concern, the ${w}$s. From our original change of basis, we know that $w^k-w^*=Qx^k$. Another way of writing the multiplication of the matrix $Q$ by the vector $x^k$ is as $\sum_i x_i^kq_i$. But we've shown above that $x_i^{k}=(1-\alpha \lambda _i)^{k}x_i^0$. Plugging everything together, we have obtained the desired "closed form" formula for the GD update step:
$w^k-w^*=\sum_i x_i^0(1-\alpha \lambda _i)^{k} q_i$
This is essentially an expression for the "error" at iteration $k$ of GD (how far we are from the optimal solution, $w^*$). Since we're interested in evaluating the performance of GD, this is the expression we want to analyze. There are two immediate observations. The first is that this term goes to 0 as $k$ goes to infinity, which is of course good news. The second is that the error decomposes very nicely into the separate elements of $x_0$, which is even nicer for the sake of our analysis. Here I quote from the original post, since I think they explain it nicely:
Each element of $x^0$ is the component of the error in the initial guess in the $Q$-basis. There are $n$ such errors, and each of these errors follows its own, solitary path to the minimum, decreasing exponentially with a compounding rate of $1-\alpha \lambda_i $. The closer that number is to 1, the slower it converges.
I hope this clears things up for you enough that you can go on to continue reading the post. It's a really good one!
Best Answer
It's not used because it's counter productive.
Just about the only justification for using gradient descent (and it's really not a good justification at all, as you will see if you read through some of the posts on the topic on this site) is that one avoids needing to calculate the Hessian, as this can be very expensive for high dimensional problems. So once you've calculated the Hessian, you've taken away gradient descent's strength: not needing to calculate the Hessian.
If you have calculated the Hessian, you're better of using something like Newton's Method.