Solved – Difference in R-squared observed from statsmodels when WLS is used

pythonr-squaredregressionstatsmodelsweighted-regression

Recently I have been trying to solve one of my problems with OLS and WLS respectively, and was trying to determine whether a weighted regression would be more suitable by comparing the R^2 value.

I used statsmodels to produce the R^2 for both of the models and I also have another function which uses its own formula to calculate the R^2 of the model:

y_pred = model.predict(X)
SS_Residual = sum((y - y_pred) ** 2)
SS_Total = sum((y - np.mean(y)) ** 2)
r_squared = 1 - (float(SS_Residual)) / SS_Total

This works perfectly when the model is an OLS, but the result differs by a huge margin (what statsmodels produce and what my hardcoded module produce_ when the model is a WLS.

I would like to know if there's a different way for statsmodels to calculate the R-squared for WLS model or there's something wrong with my approach. Thank you!

This is the OLS result which has ~0.3 R^2, same as what my function have calculated.

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.306
Model:                            OLS   Adj. R-squared:                  0.298
Method:                 Least Squares   F-statistic:                     40.93
Date:                Mon, 26 Feb 2018   Prob (F-statistic):           6.30e-09
Time:                        14:27:34   Log-Likelihood:                 315.72
No. Observations:                  95   AIC:                            -627.4
Df Residuals:                      93   BIC:                            -622.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0007      0.001     -0.725      0.470      -0.002       0.001
p5             0.5629      0.088      6.397      0.000       0.388       0.738
==============================================================================
Omnibus:                        5.067   Durbin-Watson:                   2.182
Prob(Omnibus):                  0.079   Jarque-Bera (JB):                4.416
Skew:                           0.500   Prob(JB):                        0.110
Kurtosis:                       3.341   Cond. No.                         97.3
==============================================================================

However, when I use a WLS with weights, the R-squared produced is drastically increased to ~0.7, while the coefficient in fact doesnt change a lot, and my function have calculated a 0.3 R^2 for this WLS model instead.

     WLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.772
Model:                            WLS   Adj. R-squared:                  0.769
Method:                 Least Squares   F-statistic:                     314.5
Date:                Mon, 26 Feb 2018   Prob (F-statistic):           1.37e-31
Time:                        14:27:34   Log-Likelihood:                -14.763
No. Observations:                  95   AIC:                             33.53
Df Residuals:                      93   BIC:                             38.63
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0006      0.001     -1.170      0.245      -0.002       0.000
p5             0.6230      0.035     17.733      0.000       0.553       0.693
==============================================================================
Omnibus:                       27.432   Durbin-Watson:                   1.889
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              161.320
Skew:                           0.609   Prob(JB):                     9.33e-36
Kurtosis:                       9.267   Cond. No.                         63.8
==============================================================================

Best Answer

The way all the packages calculate a R square for weighted least square regression is different from the way they do it for ordinary least square regression. So your answer will not match with the results produced by any of the packages. The key change is in the way SS_Total is computed. Instead of using the simple arithmetic mean, the packages use a weighted mean with the same weights which were used to calculate the WLS estimator.

The interpretation of the change in the formula is that the R^2 now tells you the proportion of total variation in weighted Y explained by the weighted X. Which is more intuitive and hence it is used in most packages.

Related Question