Solved – Different output for R lm() and python statsmodel OLS for linear regression

least squareslmpythonrstatsmodels

I'm exploring linear regressions in R and Python, and usually get the same results but this is an instance I do not.

I added the sum of Agriculture and Education to the swiss dataset as an additional explanatory variable, with Fertility as the regressor.

R gives me an NA for the $\beta$ value of z, but Python gives me a numeric value for z and a warning about a very small eigenvalue. Is there a way to make Python and statmodels explicitly tell me that z adds no information to the regressor?

Additionally, I originally did this analysis in an iPython notebook, where there is no need to do an explicit print of the regression summary results reg_results, and when the print command is omitted there is no warning about the low eigenvalues which makes it more difficult to know that z is worthless.

R code:

data(swiss)
swiss$z <- swiss$Agriculture + swiss$Education
formula <- 'Fertility ~ .'
print(lm(formula, data=swiss))

R output:

Call:
lm(formula = formula, data = swiss)

Coefficients:
     (Intercept)       Agriculture       Examination         Education
         66.9152           -0.1721           -0.2580           -0.8709
        Catholic  Infant.Mortality                 z
          0.1041            1.0770                NA

Python Code:

import statsmodels.formula.api as sm
import pandas.rpy.common as com

swiss = com.load_data('swiss')

# get rid of periods in column names
swiss.columns = [_.replace('.', '_') for _ in swiss.columns]

# add clearly duplicative data
swiss['z'] = swiss['Agriculture'] + swiss['Education']

y = 'Fertility'
x = "+".join(swiss.columns - [y])
formula = '%s ~ %s' % (y, x)
reg_results = sm.ols(formula, data=swiss).fit().summary()
print(reg_results)

Python output:

                            OLS Regression Results
==============================================================================
Dep. Variable:              Fertility   R-squared:                       0.707
Model:                            OLS   Adj. R-squared:                  0.671
Method:                 Least Squares   F-statistic:                     19.76
Date:                Thu, 25 Sep 2014   Prob (F-statistic):           5.59e-10
Time:                        22:55:42   Log-Likelihood:                -156.04
No. Observations:                  47   AIC:                             324.1
Df Residuals:                      41   BIC:                             335.2
Df Model:                           5
====================================================================================
                       coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept           66.9152     10.706      6.250      0.000        45.294    88.536
Agriculture          0.1756      0.062      2.852      0.007         0.051     0.300
Catholic             0.1041      0.035      2.953      0.005         0.033     0.175
Education           -0.5233      0.115     -4.536      0.000        -0.756    -0.290
Examination         -0.2580      0.254     -1.016      0.315        -0.771     0.255
Infant_Mortality     1.0770      0.382      2.822      0.007         0.306     1.848
z                   -0.3477      0.073     -4.760      0.000        -0.495    -0.200
==============================================================================
Omnibus:                        0.058   Durbin-Watson:                   1.454
Prob(Omnibus):                  0.971   Jarque-Bera (JB):                0.155
Skew:                          -0.077   Prob(JB):                        0.925
Kurtosis:                       2.764   Cond. No.                     1.11e+08
==============================================================================

Warnings:
[1] The smallest eigenvalue is 3.87e-11. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

“`

Best Answer

OLS in statsmodels has currently no option to drop singular columns.

statsmodels OLS is using the Moore-Penrose generalized inverse, pinv, to solve the linear least squares problem. This means that the reported covariance has reduced rank. However, we can use estimable contrasts to get and test the effects for which the covariance is of full rank.

If a user wants to have a full rank solution with statsmodels, the user has to decide which of the collinear columns to drop.

One way to find collinear columns is described here https://stackoverflow.com/questions/13312498/how-to-find-degenerate-rows-columns-in-a-covariance-matrix

(Most likely an option to automatically drop collinear columns will be available in a future version of statsmodels.)