Solved – Why is gradient descent so bad at optimizing polynomial regression

gradient descentmachine learningregressionscikit learnstatsmodels

As part of a self-study exercise, I am comparing various implementations of polynomial regression:

Closed form solution
Gradient descent with Numpy
Scipy optimize
Sklearn
Statsmodel

When the problem involves polynomials of degree 3 or less, no problem, all three approaches yield the same coefficients. However, when the order increases to degree 5, 10 or even 15, I find it impossible to find the correct minimum using my numpy and scipy.optimize implementations.

Question:

Why is gradient descent, and to a certain extent the scipy.optimize algorithm, so bad a optimizing polynomial regression ?

Is this because the cost function is non convex ? Not smooth ? Due to numerical instability or collinearity ?

Example

In my model, there is only one variable and design matrix takes the form $1,x, x^2, x^3, …, x^n$. The data is based on a sine function with uniform noise.

#Initializing noisy non linear data
x = np.linspace(0,1,40)
noise = 1*np.random.uniform(  size = 40)
y = np.sin(x * 1.5* np.pi ) 
y_noise = (y + noise-1).reshape(-1,1)

Polynomial order 3

Closed form solution: $(X^TX)^{-1}X^Ty = \begin{bmatrix} 0.07 & 10.14 & -20,15 & 9.1 \end{bmatrix}$
Numpy gradient descent Same coefficients with 50,000 iterations and stepsize = 1
Scipy optimize Same coefficients using BFGS method and the first derivative (gradient)
Sklearn: same coefficients
Statsmodel: same coefficients

Polynomial order 5

Closed form solution: $(X^TX)^{-1}X^Ty = \begin{bmatrix} 0.65 & 5.82 & -17.82 & 29.10 & -35.25 & 17.08 \end{bmatrix}$
Numpy gradient descent Smaller coefficients with 50,000 iterations and stepsize = 1: $\begin{bmatrix} 0.71 & 3.98 & -5.2 & -3.23 & -0.08 & 3.44 \end{bmatrix}$
Scipy optimize Also smaller coefficients, of the same order as with the Numpy implementation. Using BFGS method and the first derivative (gradient): $\begin{bmatrix} 0.70 & 4.14 & -5.83 & -2.73 & 0.18 & 3.09 \end{bmatrix}$
Sklearn: same as analytical solution
Statsmodel: same as analytical solution

Polynomial order 16+

All methods give different results.

As the question is quite long already, you'll find the code here

Best Answer

Is this because the cost function is non convex ? Not smooth ? Due to numerical instability or collinearity ?

This appears to be simple linear regression with a sum-of-squares loss function. If you are able to obtain a closed form solution (i.e. $X^TX$ is invertible) then that loss function is both convex and continuously differentiable (smooth). (1, 2)

Why is gradient descent, and to a certain extent the scipy.optimize algorithm, so bad a optimizing polynomial regression ?

Gradient descent is known to be both slow (compared to second-derivative methods) and sensitive to step size. I also want to second what @Sycorax and @Jonny Lomond put in the comments - this particular problem is a difficult one for GD because of the massive magnitude difference across your dimensions, and your closed form solution may also be unstable. This link has has some really fantastic material on optimization challenges and momentum-based solutions including a polynomial regression example.

A few approaches you might consider:

As @Jonny Lomond suggested, standardize each polynomial separately, or tune your step size.
Plot your loss function over iterations to determine if there are any obvious problems with your optimization. If your gradient is "overshooting", you could try using an adaptive step size (reducing it as a function of the number of iterations).
Use backtracking to dynamically determine a better step size at each iteration.
Use a momentum-based gradient method like Nesterov accelerated gradient descent. These approaches are almost as fast (in terms of convergence) as second order methods in practice.

Related Solutions

Solved – sklearn Linear Regression vs Batch Gradient Descent

There are some problems in your question.

Can you make sure the iterative solver converge?

Note that, we can solve linear regression / minimizing squared loss in different ways. My experience of using python scikit-learn is the default set up usually will not give the result that converge. It is possible that we are limiting number of iterations in iterative solver, and stopped early. If we stop early, it is half done work, so it will not be as same as the optimal solution you got from other algorithms.

I would not agree on

LinearRegression is not good if the data set is large, in which case stochastic gradient descent needs to be used.

If we are using QR decomposition, even data is on the level of millions (hopefully this is large enough), as well as number of features is not big, we can solve it in second. Check this R code. You may surprised that we can solve a linear regression on million data points with less than 1 sec.

x=matrix(runif(2e6),ncol=2)
y=runif(1e6)
stime = proc.time()
lm(y~x)
print(proc.time()-stime)

Which approaches exist for optimization in machine learning

Towards Data Science isn't a reliable website, and the text you've quoted is, unfortunately, nonsense.

For any Optimization problem with respect to Machine Learning, there can be either a numerical approach or an analytical approach. The numerical problems are Deterministic, meaning that they have a closed form solution which doesn’t change. [...] These closed form solutions are solvable analytically. But these are not optimization problems.

What they meant to say, I hope, is that "analytical problems are Determinstic [...]", etc.

I won't explain the difference between analytic and numeric approaches here, because there are lots of good sources, but going by this paragraph I'm going to say the post you read isn't one of them.

EDIT: OK, I'll explain a bit

Part of the problem is that there are a lot of partially overlapping terms. Very roughly speaking, you have:

Models where you can directly calculate the parameters: AKA closed-form solutions, analytical or analytic solutions, or sometimes algebraic solutions.
Models where you have to use an iterative algorithm to fit the parameters. All such models are numerical, but
- They might be deterministic (no randomness), like batch gradient descent with fixed starting points, or stochastic (random), like stochastic gradient descent.
- They might always reach the best value (convex optimisation), or might have a risk of getting stuck at local optima (non-convex optimisation)

There are plenty of other ways to slice this up, but these should be plenty to get started!