[Math] Feature scaling’s effect on gradient descent

gradient descentmachine learning

In Andrew Ng's machine learning class, he mentioned feature scaling will make gradient descent goes faster.

https://www.coursera.org/learn/machine-learning/supplement/CTA0D/gradient-descent-in-practice-i-feature-scaling

Specifically:

We can speed up gradient descent by having each of our input values in
roughly the same range. This is because θ will descend quickly on
small ranges and slowly on large ranges, and so will oscillate
inefficiently down to the optimum when the variables are very uneven.

Why it would work?

Best Answer

The gradient descend uses one fixed learning rate for all $\theta$'s, so we need to choose the value based on the input value having the smallest range. Otherwise the gradient descent might not converge for that small range. Now with that small learning rate it takes ages for the large range to converge.

There is also good explanation in Quora

Essentially, scaling the inputs (through mean normalization, or z-score) gives the error surface a more spherical shape, where it would otherwise be a very high curvature ellipse. Since gradient descent is curvature-ignorant, having an error surface with high curvature will mean that we take many steps which aren't necessarily in the optimal direction. When we scale the inputs, we reduce the curvature, which makes methods that ignore curvature (like gradient descent) work much better. When the error surface is circular (spherical), the gradient points right at the minimum, so learning is easy.

Related Question