It's not used because it's counter productive.

Just about the *only* justification for using gradient descent (and it's really not a good justification at all, as you will see if you read through some of the posts on the topic on this site) is that one avoids needing to calculate the Hessian, as this can be very expensive for high dimensional problems. So once you've calculated the Hessian, you've taken away gradient descent's strength: not needing to calculate the Hessian.

If you have calculated the Hessian, you're better of using something like Newton's Method.

There are several options available to you:

Try to compute the derivatives by hand and then implement them in code.

Use a symbolic computation package like Maple, Mathematica, Wolfram Alpha, etc. to find the derivatives. Some of these packages will translate the resulting formulas directly into code.

Use an automatic differentiation tool that takes a program for computing the cost function and (using compiler like techniques) produces a program that computes the derivatives as well as the cost function.

Use finite difference formulas to approximate the derivatives.

For anything other than the simplest problems (like ordinary least squares), option 1 is a poor choice. Most experts on optimization will tell you that it is very common for users of optimization software to supply incorrect derivative formulas to optimization routines. This typically leads to slow convergence or no convergence at all.

Option 2 is a good one for most relatively simple cost functions. It doesn't require really exotic tools.

Option 3 really shines when the cost function is the result of a fairly complicated function for which you have the source code. However, AD tools are specialized and not many users of optimization software are familiar with them.

Option 4 is sometimes a necessary choice. If you have a "black box" function that you can't get source code for (or that is so badly written that AD tools can't handle it), finite difference approximations can save the day. However, using finite difference approximations has a significant cost in run time and in the accuracy of the derivatives and ultimately the solutions obtained.

For most machine learning applications, options 1 and 2 are perfectly adequate. The loss functions (least squares, logistic regression, etc.) and penalties (one-norm, two-norm, elastic net, etc.) are simple enough that the derivatives are easy to find.

Options 3 and 4 come into play more often in engineering optimization where the objective functions are more complicated.

## Best Answer

Gradient is the partial derivative

s:$$\nabla f = \left(\frac{\partial f}{\partial x_1};\frac{\partial f}{\partial x_2};...;\frac{\partial f}{\partial x_n}\right)$$

Eg : $f=x^2y$

$$\nabla f =(2xy;x^2)$$

Gradient gives the rate of change in every direction $e$ ($e$ is a unit vector) thanks to the dot product $\nabla f.e$ :

Eg :$\nabla f.(0;1)=\frac{\partial f}{\partial y}$