It seems obvious to me to first widely explore the optimization landscape (this is effectively what simulated annealing does) and get a sense of the problem structure. Only then, after finding which hill to climb, perform gradient descent. Why isn't this done more often?
[Math] Why don’t people do simulated annealing before gradient descent
gradient descentoptimization
Best Answer
To give an example of deep learning, the number of parameters (in Millions) is so huge that simulated annealing may take longer than just doing a gradient descent from whatever (random) initial state your weights are currently in.
So, in case of deep learning it doesn't make (economic) sense to do simulated annealing.