Solved – Extremly poor polynomial fitting with SVR in sklearn

nonlinear regressionregressionridge regressionscikit learn

I try to fit an obvious around degree 5 polynomial function.
Much to my despair, sklearn bluntly refuses to match the polynomial, and instead output a 0-degree like function.

Here is the code. All you need to know is that sp_tr is a m×n matrix of n features and that I take the first column (i_x) as my input data and the second one (i_y) as my output data.

    x_min = sp_tr[:,i_x].min()
    x_max = sp_tr[:,i_x].max()
    xs = numpy.arange( x_min, x_max, (x_max - x_min)/100 )

    sp_clf = SVR( degree=5 )
    sp_clf.fit( sp_tr[:,[i_x]], sp_tr[:,i_y] )
    ys = sp_clf.predict( numpy.transpose([xs]) )

Then I plot xs, ys as a red line, and I plot the data learned from a blue dots (sp_tr[:,i_x] to sp_tr[:,i_y]). Here are the result I obtain, first with the kernel ridge method, second with SVR.

Polynomial fitting with kernel ridge
Polynomial fitting with SVR

What happened ? How could SVR and Kernel Ridge go so wrong to believe the relation is constant ? What could be done to have something more satisfying ?
Thanks for any help.

Best Answer

In short, you need to tune your parameters. Here's the sklearn docs:

The free parameters in the model are C and epsilon.

and their descriptions:

C : float, optional (default=1.0)

Penalty parameter C of the error term.

epsilon : float, optional (default=0.1)

Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value.

It looks like you have an under-penalized model, it is not punished harshly enough for straying away from the data. Let's check.

I generated some polynomial data that is on approximately the same scale as yours:

xs = np.linspace(0, 1, 100)
ys = 400*(xs - 2*xs*xs + xs*xs*xs) - 20
scatter(xs, ys, alpha=.25)

CubicSVR

And then fit the SVR with the default parameters:

clf = SVR(degree=3)
clf.fit(np.transpose([xs]), ys)
yf = clf.predict(numpy.transpose([xs]))

Which gives me essentially the same issue as you:

UnderPenalizedSVR

Using the intuition that the parameters are under-penalizing the fit, I adjusted them:

clf = SVR(degree=3, C=100, epsilon=.01)

Which gives me a pretty good fit:

GoodFitSVM

In general, whenever your model has free parameters like this, it is very important to tune them carefully. sklearn makes this as convenient as possible, it supplies the grid_search module, which will try many models in parallel with different tuning parameters and choose the one that best fits your data. Also important is getting the measurement of best fits your data correct, as the model fit measured using the training data is not a good representation of the model fit on unseen data. Use cross validation or a sample of held out data to examine how well your model fits. In your case, I would recommend using cross validation with GridSearchCV.