Solved – Polynomial regression seems to give different coefficients depending on Python or R

polynomialpythonrregression

When I fit a polynomial on the Boston data set with R, I seem to get different results than when I use Python. Example code with R:

library(MASS)
attach(Boston)
lm.fit = lm(nox ~ poly(dis, 3), data = Boston)
summary(lm.fit)

This yields the coefficients

Coefficients:
               Estimate
(Intercept)    0.554695
poly(dis, 3)1 -2.003096
poly(dis, 3)2  0.856330
poly(dis, 3)3 -0.318049

With Python:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn import datasets
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

boston = datasets.load_boston()
df = pd.DataFrame(boston['data'], columns=boston['feature_names'])
df = pd.concat([df, pd.Series(boston['target'], name='MEDV')], axis=1)
df_x = df[['DIS']]
df_y = df[['NOX']]

poly = PolynomialFeatures(3)
df_x_transform = poly.fit_transform(df_x)

lin_regressor = LinearRegression()
lin_regressor.fit(df_x_transform, df_y) 
print(lin_regressor.intercept_, lin_regressor.coef_)

Yields:

0.9341280720211879 [ 0.         -0.18208169  0.02192766 -0.000885  ]

And with statsmodels:

model = sm.OLS(df_y, df_x_transform)
fitted = model.fit()
print(fitted.summary())

We get the same result as with sklearn:

                 coef
const          0.9341
x1            -0.1821
x2             0.0219
x3            -0.0009

How is this possible?

Best Answer

R uses an orthogonal basis expansion while PolynomialFeautres does not. Try passing raw=TRUE in poly. What are the results?

Related Solutions

Solved – Interpreting multiple polynomial regression coefficients

The first thing I would do is rescale the independent variables so there are fewer leading zeroes after the decimal. Maybe multiply each by 1000.

Next, center your variables.

Then, I note that you seem to have an N of 31 which means your model is very overfit. So, I'd either gather a lot more data or make a much simpler model.

Finally, to your question: The interpretation of a polynomial regression is the same, whether there is one or more; what I like to do in cases like this is make lots of graphs of the predicted value at different combinations of the input variables.

Solved – Correct way to use polynomial regression in Python

Here's how I go about it in pure sklearn. There are probably ways to improve this workflow.

First I made a tranformer that simply selects one column from a DataFrame or matrix:

class ColumnSelector(object):

    def __init__(self, idxs):
        self.idxs = np.asarray(idxs)

    # Fit here doesn't need to do anything.  We already know the indices of the columns
    # we want to keep.
    def fit(self, *args, **kwargs):
        return self

    def transform(self, X, **transform_params):
        # Need to teat pandas data frames and numpy arrays slightly differently.
        if isinstance(X, pd.DataFrame):
            return X.iloc[:, self.idxs]
        return X[:, self.idxs]

Then I made a PolynomialExpansion class

class PolynomialExpansion(object):

    def __init__(self, degree):
        self.degree = degree

    def fit(self, *args, **kwargs):
        return self

    def transform(self, X, **transform_params):
        # Initialize our return value as a matrix of all zeros.
        # We are going to overwrite all of these zeros in the code below.
        X_poly = np.zeros((X.shape[0], self.degree))
        # The first column in our transformed matrix is just the vector we started with.
        X_poly[:, 0] = X.squeeze()
        # Cleverness Alert:
        # We create the subsequent columns by multiplying the most recently created column
        # by X.  This creates the sequence X -> X^2 -> X^3 -> etc...
        for i in range(1, self.degree):
            X_poly[:, i] = X_poly[:, i-1] * X.squeeze()
        return X_poly

Then to use it, I combine it with Pipelines and FeatureUnions. This pipeline uses the arsenic data from Gelman and Hill

wells_pipeline = Pipeline([
    ('polynomial_expansions', FeatureUnion([
        ('arsenic_quadratic', Pipeline([
            ('arsenic_selector', ColumnSelector([0])),
            ('quadratic_expansion', PolynomialExpansion(2))
        ])),
        ('distance_quadratic', Pipeline([
            ('distance_selector', ColumnSelector([1])),
            ('quadratic_expansion', PolynomialExpansion(2))         
        ]))
    ])),
    ('regression', LogisticRegression())
])

Another option is to use patsy, which allows you to specify model formulas, as in R.

Related Question

Solved – Polynomial Regression using Gradient Descent for approximation of a sine in python