I'm exploring linear regressions in R and Python, and usually get the same results but this is an instance I do not.
I added the sum of Agriculture
and Education
to the swiss
dataset as an additional explanatory variable, with Fertility
as the regressor.
R gives me an NA
for the $\beta$ value of z
, but Python gives me a numeric value for z
and a warning about a very small eigenvalue. Is there a way to make Python and statmodels explicitly tell me that z
adds no information to the regressor?
Additionally, I originally did this analysis in an iPython notebook, where there is no need to do an explicit print
of the regression summary results reg_results
, and when the print
command is omitted there is no warning about the low eigenvalues which makes it more difficult to know that z
is worthless.
R code:
data(swiss)
swiss$z <- swiss$Agriculture + swiss$Education
formula <- 'Fertility ~ .'
print(lm(formula, data=swiss))
R output:
Call:
lm(formula = formula, data = swiss)
Coefficients:
(Intercept) Agriculture Examination Education
66.9152 -0.1721 -0.2580 -0.8709
Catholic Infant.Mortality z
0.1041 1.0770 NA
Python Code:
import statsmodels.formula.api as sm
import pandas.rpy.common as com
swiss = com.load_data('swiss')
# get rid of periods in column names
swiss.columns = [_.replace('.', '_') for _ in swiss.columns]
# add clearly duplicative data
swiss['z'] = swiss['Agriculture'] + swiss['Education']
y = 'Fertility'
x = "+".join(swiss.columns - [y])
formula = '%s ~ %s' % (y, x)
reg_results = sm.ols(formula, data=swiss).fit().summary()
print(reg_results)
Python output:
OLS Regression Results
==============================================================================
Dep. Variable: Fertility R-squared: 0.707
Model: OLS Adj. R-squared: 0.671
Method: Least Squares F-statistic: 19.76
Date: Thu, 25 Sep 2014 Prob (F-statistic): 5.59e-10
Time: 22:55:42 Log-Likelihood: -156.04
No. Observations: 47 AIC: 324.1
Df Residuals: 41 BIC: 335.2
Df Model: 5
====================================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept 66.9152 10.706 6.250 0.000 45.294 88.536
Agriculture 0.1756 0.062 2.852 0.007 0.051 0.300
Catholic 0.1041 0.035 2.953 0.005 0.033 0.175
Education -0.5233 0.115 -4.536 0.000 -0.756 -0.290
Examination -0.2580 0.254 -1.016 0.315 -0.771 0.255
Infant_Mortality 1.0770 0.382 2.822 0.007 0.306 1.848
z -0.3477 0.073 -4.760 0.000 -0.495 -0.200
==============================================================================
Omnibus: 0.058 Durbin-Watson: 1.454
Prob(Omnibus): 0.971 Jarque-Bera (JB): 0.155
Skew: -0.077 Prob(JB): 0.925
Kurtosis: 2.764 Cond. No. 1.11e+08
==============================================================================
Warnings:
[1] The smallest eigenvalue is 3.87e-11. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
“`
Best Answer
OLS in statsmodels has currently no option to drop singular columns.
statsmodels OLS is using the Moore-Penrose generalized inverse, pinv, to solve the linear least squares problem. This means that the reported covariance has reduced rank. However, we can use estimable contrasts to get and test the effects for which the covariance is of full rank.
If a user wants to have a full rank solution with statsmodels, the user has to decide which of the collinear columns to drop.
One way to find collinear columns is described here https://stackoverflow.com/questions/13312498/how-to-find-degenerate-rows-columns-in-a-covariance-matrix
(Most likely an option to automatically drop collinear columns will be available in a future version of statsmodels.)