In Andrew Ng's machine learning class, he mentioned feature scaling will make gradient descent goes faster.
Specifically:
We can speed up gradient descent by having each of our input values in
roughly the same range. This is because θ will descend quickly on
small ranges and slowly on large ranges, and so will oscillate
inefficiently down to the optimum when the variables are very uneven.
Why it would work?
Best Answer
The gradient descend uses one fixed learning rate for all $\theta$'s, so we need to choose the value based on the input value having the smallest range. Otherwise the gradient descent might not converge for that small range. Now with that small learning rate it takes ages for the large range to converge.
There is also good explanation in Quora