[Math] Why don’t people do simulated annealing before gradient descent

gradient descentoptimization

It seems obvious to me to first widely explore the optimization landscape (this is effectively what simulated annealing does) and get a sense of the problem structure. Only then, after finding which hill to climb, perform gradient descent. Why isn't this done more often?

Best Answer

To give an example of deep learning, the number of parameters (in Millions) is so huge that simulated annealing may take longer than just doing a gradient descent from whatever (random) initial state your weights are currently in.

So, in case of deep learning it doesn't make (economic) sense to do simulated annealing.