Solved – Correct way to use polynomial regression in Python

nonlinear regressionpolynomialpythonregression

I'm having some trouble getting reasonable responses from a polynomial regression. When I used linear regression I got a reasonable error, but when I switched to Polynomial regressions the error jumped up by almost seven times. It could be because it's just a bad model, but I suspect that I'm doing something wrong. Here's the code that I'm using (I've left a little off the beginning which gets the data and does some processing):

feature_columns = list(train)
feature_columns.remove('SalePrice')
X = train[feature_columns]
y = train['SalePrice']

poly = PolynomialFeatures(degree=2)
polyX = poly.fit_transform(X)
polyTest = poly.fit_transform(test[feature_columns])

lm = LinearRegression()
lm.fit(polyX, y)

import numpy
predictions = numpy.absolute(lm.predict(polyTest).round(decimals = 2))

pandas.DataFrame({'Id': pandas.to_numeric(test.Id, downcast = 'integer'), 'SalePrice':predictions}).to_csv('2017-07-02-2.csv', index = False)

I'm not sure if it matters but there are more than 250 features that are being processed. Another issue, which might be related is that if I try to use a degree larger than 2 then it takes too long to run. What am I doing wrong here?

Best Answer

Here's how I go about it in pure sklearn. There are probably ways to improve this workflow.

First I made a tranformer that simply selects one column from a DataFrame or matrix:

class ColumnSelector(object):

    def __init__(self, idxs):
        self.idxs = np.asarray(idxs)

    # Fit here doesn't need to do anything.  We already know the indices of the columns
    # we want to keep.
    def fit(self, *args, **kwargs):
        return self

    def transform(self, X, **transform_params):
        # Need to teat pandas data frames and numpy arrays slightly differently.
        if isinstance(X, pd.DataFrame):
            return X.iloc[:, self.idxs]
        return X[:, self.idxs]

Then I made a PolynomialExpansion class

class PolynomialExpansion(object):

    def __init__(self, degree):
        self.degree = degree

    def fit(self, *args, **kwargs):
        return self

    def transform(self, X, **transform_params):
        # Initialize our return value as a matrix of all zeros.
        # We are going to overwrite all of these zeros in the code below.
        X_poly = np.zeros((X.shape[0], self.degree))
        # The first column in our transformed matrix is just the vector we started with.
        X_poly[:, 0] = X.squeeze()
        # Cleverness Alert:
        # We create the subsequent columns by multiplying the most recently created column
        # by X.  This creates the sequence X -> X^2 -> X^3 -> etc...
        for i in range(1, self.degree):
            X_poly[:, i] = X_poly[:, i-1] * X.squeeze()
        return X_poly

Then to use it, I combine it with Pipelines and FeatureUnions. This pipeline uses the arsenic data from Gelman and Hill

wells_pipeline = Pipeline([
    ('polynomial_expansions', FeatureUnion([
        ('arsenic_quadratic', Pipeline([
            ('arsenic_selector', ColumnSelector([0])),
            ('quadratic_expansion', PolynomialExpansion(2))
        ])),
        ('distance_quadratic', Pipeline([
            ('distance_selector', ColumnSelector([1])),
            ('quadratic_expansion', PolynomialExpansion(2))         
        ]))
    ])),
    ('regression', LogisticRegression())
])

Another option is to use patsy, which allows you to specify model formulas, as in R.