I try to fit an obvious around degree 5 polynomial function.
Much to my despair, sklearn
bluntly refuses to match the polynomial, and instead output a 0-degree like function.
Here is the code. All you need to know is that sp_tr
is a m
×n
matrix of n
features and that I take the first column (i_x
) as my input data and the second one (i_y
) as my output data.
x_min = sp_tr[:,i_x].min()
x_max = sp_tr[:,i_x].max()
xs = numpy.arange( x_min, x_max, (x_max - x_min)/100 )
sp_clf = SVR( degree=5 )
sp_clf.fit( sp_tr[:,[i_x]], sp_tr[:,i_y] )
ys = sp_clf.predict( numpy.transpose([xs]) )
Then I plot xs, ys as a red line, and I plot the data learned from a blue dots (sp_tr[:,i_x]
to sp_tr[:,i_y]
). Here are the result I obtain, first with the kernel ridge method, second with SVR.
What happened ? How could SVR and Kernel Ridge go so wrong to believe the relation is constant ? What could be done to have something more satisfying ?
Thanks for any help.
Best Answer
In short, you need to tune your parameters. Here's the sklearn docs:
and their descriptions:
It looks like you have an under-penalized model, it is not punished harshly enough for straying away from the data. Let's check.
I generated some polynomial data that is on approximately the same scale as yours:
And then fit the SVR with the default parameters:
Which gives me essentially the same issue as you:
Using the intuition that the parameters are under-penalizing the fit, I adjusted them:
Which gives me a pretty good fit:
In general, whenever your model has free parameters like this, it is very important to tune them carefully.
sklearn
makes this as convenient as possible, it supplies the grid_search module, which will try many models in parallel with different tuning parameters and choose the one that best fits your data. Also important is getting the measurement of best fits your data correct, as the model fit measured using the training data is not a good representation of the model fit on unseen data. Use cross validation or a sample of held out data to examine how well your model fits. In your case, I would recommend using cross validation withGridSearchCV
.