Solved – How does Stata calculate RMSE in regression with weights

errorregressionrmsstataweighted-data

This problem came up because I was trying to replicate some results I was getting in Stata with R, and I was able to replicate everything except for the root mean squared error.

When I run a regression in Stata, I am able to perfectly replicate the calculated RMSE using a manual calculation.

However, when I include any kind of weight, I'm unable to replicate this.

Here's code to create some example data.

clear
set seed 12345
drawnorm x e, n(1000)
gen y = 5+(2*x)+e
gen w = x*runiform()

Here is code for running a basic regression and calculating the RMSE.

reg y x
local df = `e(df_r)'
predict yhat
gen sqerr = (yhat-y)^2
egen mse = total(sqerr)
replace mse = mse/`df'
gen rmse = sqrt(mse)
mean rmse

Here, I get an output RMSE of 1.0151, and a manually-calculated RMSE of 1.0151.

Here is the code for doing it with weights.

drop yhat sqerr mse rmse
reg y x [aweight=w]
local df = `e(df_r)'
predict yhat
gen sqerr = (yhat-y)^2
egen mse = total(sqerr)
replace mse = mse/`df'
gen rmse = sqrt(mse)
mean rmse

Here, I get an output RMSE of 0.985 and a manually-calculated RMSE of 1.442.

Any ideas? I've searched the web, and everyone seems to agree the RMSE should be calculated with $\sqrt{\frac{\sum{(predict-actual)^2}}{df}}$. Any ideas of how this formula is different when weights are included?

Best Answer

The essence of this I do take to be statistical, rather than a pure programming problem.

As you don't use weights in your manual calculation, the lack of agreement is at first sight not surprising, but there are other problems too, some but not all trivial.

  1. Random number generation. Although you carefully set seed note also that Stata's random number generator changed in Stata 14 (http://www.stata.com/help.cgi?whatsnew13to14 #15). Below I am using Stata 10.1 on an old computer. This just affects which numbers are generated and should be separate from getting the same answers out of regress and a hand-calculation. You should able to reproduce what I am doing with version control.

  2. Using double precision consistently. regress uses double precision, so hand calculations should follow along.

  3. Using weights that make sense. Your weights by construction have expectation zero, but Stata ignores negative weights, typically about half of them. We could adjust for that by comparing like with like, but it's cleaner to generate positive weights directly.

  4. Using the same formula. This is what interests you.

The code here doesn't follow yours. In some cases I have used slightly more direct routes. Note that scalars hold more precision than locals. The number of decimal places here is ridiculous for every purpose except the present purpose of reproducing a calculation.

. clear

. set seed 12345

. drawnorm x e, n(1000)
(obs 1000)

. gen y = 5+(2*x)+e

. gen double w = runiform()

. reg y x

      Source |       SS       df       MS              Number of obs =    1000
-------------+------------------------------           F(  1,   998) = 3713.72
       Model |  3846.76283     1  3846.76283           Prob > F      =  0.0000
    Residual |  1033.75252   998  1.03582417           R-squared     =  0.7882
-------------+------------------------------           Adj R-squared =  0.7880
       Total |  4880.51536   999  4.88540076           Root MSE      =  1.0178

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   1.963036   .0322124    60.94   0.000     1.899824    2.026248
       _cons |   5.015532   .0322162   155.68   0.000     4.952313    5.078752
------------------------------------------------------------------------------

. di %23.18f e(rmse) 
   1.017754475398205700

. predict double err, res 

. gen double sqerr = err^2

. su sqerr, meanonly 

. di _n "rmse " %23.18f sqrt(r(sum)/e(df_r)) 

rmse    1.017754475398205700

. reg y x [aw=w] 
(sum of wgt is   4.9129e+02)

      Source |       SS       df       MS              Number of obs =    1000
-------------+------------------------------           F(  1,   998) = 3666.47
       Model |  3913.22661     1  3913.22661           Prob > F      =  0.0000
    Residual |  1065.16666   998  1.06730126           R-squared     =  0.7860
-------------+------------------------------           Adj R-squared =  0.7858
       Total |  4978.39327   999  4.98337665           Root MSE      =  1.0331

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   1.984769   .0327783    60.55   0.000     1.920447    2.049092
       _cons |   5.005039   .0326721   153.19   0.000     4.940925    5.069153
------------------------------------------------------------------------------

. di %23.18f e(rmse) 
   1.033102736537657100

. predict double werr, res 

. su w, meanonly 

. scalar wsum = r(sum) 

. gen double sqwerr = (w/wsum) * e(N) * werr^2

. su sqwerr, meanonly 

. di _n "rmse " %23.18f sqrt(r(sum)/e(df_r)) 

rmse    1.033102736537657300

I regard Stata code as being as self-evident as presumably R users feel also about R. What may help is that r(sum), e(N) and e(df_r) are saved scalar results accessible after running commands. res is the option that instructs predict to produce plain residuals, not predicted values.

The very minor discrepancy in the second case I trust to be trivial.

If compatibility with R or anything else continues to worry, I'd recommend a test suite not dependent on either's random number generator.

Related Question