Regression – Are Two Linear Regression Models Significantly Different? A Python Approach

hypothesis testingpythonregression

This question extends What test should be used to tell if two linear regression lines are significantly different? to the more general case of having two estimated models.

I have got the following two data series. Are the two corresponding linear regression models significantly different?

Since I barely know R, I would be very happy to learn about Python code to answer this. (Python e.g. has mlpy.ols_base or sklearn.linear_model.LinearRegression to compute the models.)

If you can answer with an R implementation, please provide the full code.

Series 1:

x   y
3.7117  0.0033
13.3551 0.1259
18.1202 0.1978
23.0639 0.2701
27.752  0.327

Series 2:

x   y
7.5829  0.0521
12.2515 0.1165
5.2919  0.0231
17.1492 0.1918
10.0384 0.0916
3.3088  0.0012
21.8032 0.2358
14.6613 0.1477
7.5773  0.0657
1.4326  -0.0366
8.1549  0.0651
8.9286  0.0684
16.8413 0.1687
17.9991 0.1849
1.5386  -0.0366
8.3319  0.0561
8.9153  0.0667
11.5032 0.0968
16.8197 0.1683
18.0486 0.1844
2.1863  -0.0073
9.1413  0.0787
8.9726  0.0674
12.0396 0.1044
16.8161 0.1699
18.3706 0.1864
3.0798  -0.0078
10.1183 0.0867
9.1358  0.0682
12.7242 0.1118
16.8679 0.1661
18.789  0.2

LibreOffice models and visualization:

LibreOffice models and visualization

Best Answer

If the variances about the lines are the same, Maarten's answer on the earlier question applies -- you stack the y's and x's add an indicator for the series (0 for first, 1 for second), and the interaction between the stacked x and the series-indicator. Then a partial F-test of whether the two terms involving the indicator are zero is a test that the regression lines are the same. You can test each of intercept or slope individually via a t-test of the terms in the indicator or the interaction respectively.

If you don't assume they have the same variance, but the two sets of data are independent and the samples are large, then you could treat the difference $\mathbf{\hat\beta}_1-\mathbf{\hat\beta}_2$ as approximately $N(0,\Sigma_D)$ under the null, where $\Sigma_D=\Sigma_1+\Sigma_2$ would be estimated by the sum of the estimated variance-covariance matrices. This would enable construction of an approximate chi-squared test of simultaneous equality of both coefficients.

If you want to test equality of just the slopes, you could do the univariate version of the above, or if the samples are not sufficiently large but you assume conditional homoskedastic normality of each response, then you would be able to do a Welch-Satterthwaite type t-test.

Related Question