I'm having some trouble getting reasonable responses from a polynomial regression. When I used linear regression I got a reasonable error, but when I switched to Polynomial regressions the error jumped up by almost seven times. It could be because it's just a bad model, but I suspect that I'm doing something wrong. Here's the code that I'm using (I've left a little off the beginning which gets the data and does some processing):
feature_columns = list(train)
feature_columns.remove('SalePrice')
X = train[feature_columns]
y = train['SalePrice']
poly = PolynomialFeatures(degree=2)
polyX = poly.fit_transform(X)
polyTest = poly.fit_transform(test[feature_columns])
lm = LinearRegression()
lm.fit(polyX, y)
import numpy
predictions = numpy.absolute(lm.predict(polyTest).round(decimals = 2))
pandas.DataFrame({'Id': pandas.to_numeric(test.Id, downcast = 'integer'), 'SalePrice':predictions}).to_csv('2017-07-02-2.csv', index = False)
I'm not sure if it matters but there are more than 250 features that are being processed. Another issue, which might be related is that if I try to use a degree larger than 2 then it takes too long to run. What am I doing wrong here?
Best Answer
Here's how I go about it in pure sklearn. There are probably ways to improve this workflow.
First I made a tranformer that simply selects one column from a DataFrame or matrix:
Then I made a PolynomialExpansion class
Then to use it, I combine it with Pipelines and FeatureUnions. This pipeline uses the arsenic data from Gelman and Hill
Another option is to use patsy, which allows you to specify model formulas, as in R.