This problem came up because I was trying to replicate some results I was getting in Stata with R, and I was able to replicate everything except for the root mean squared error.
When I run a regression in Stata, I am able to perfectly replicate the calculated RMSE using a manual calculation.
However, when I include any kind of weight, I'm unable to replicate this.
Here's code to create some example data.
clear
set seed 12345
drawnorm x e, n(1000)
gen y = 5+(2*x)+e
gen w = x*runiform()
Here is code for running a basic regression and calculating the RMSE.
reg y x
local df = `e(df_r)'
predict yhat
gen sqerr = (yhat-y)^2
egen mse = total(sqerr)
replace mse = mse/`df'
gen rmse = sqrt(mse)
mean rmse
Here, I get an output RMSE of 1.0151, and a manually-calculated RMSE of 1.0151.
Here is the code for doing it with weights.
drop yhat sqerr mse rmse
reg y x [aweight=w]
local df = `e(df_r)'
predict yhat
gen sqerr = (yhat-y)^2
egen mse = total(sqerr)
replace mse = mse/`df'
gen rmse = sqrt(mse)
mean rmse
Here, I get an output RMSE of 0.985 and a manually-calculated RMSE of 1.442.
Any ideas? I've searched the web, and everyone seems to agree the RMSE should be calculated with $\sqrt{\frac{\sum{(predict-actual)^2}}{df}}$. Any ideas of how this formula is different when weights are included?
Best Answer
The essence of this I do take to be statistical, rather than a pure programming problem.
As you don't use weights in your manual calculation, the lack of agreement is at first sight not surprising, but there are other problems too, some but not all trivial.
Random number generation. Although you carefully
set seed
note also that Stata's random number generator changed in Stata 14 (http://www.stata.com/help.cgi?whatsnew13to14 #15). Below I am using Stata 10.1 on an old computer. This just affects which numbers are generated and should be separate from getting the same answers out ofregress
and a hand-calculation. You should able to reproduce what I am doing with version control.Using
double
precision consistently.regress
uses double precision, so hand calculations should follow along.Using weights that make sense. Your weights by construction have expectation zero, but Stata ignores negative weights, typically about half of them. We could adjust for that by comparing like with like, but it's cleaner to generate positive weights directly.
Using the same formula. This is what interests you.
The code here doesn't follow yours. In some cases I have used slightly more direct routes. Note that
scalar
s hold more precision thanlocal
s. The number of decimal places here is ridiculous for every purpose except the present purpose of reproducing a calculation.I regard Stata code as being as self-evident as presumably R users feel also about R. What may help is that
r(sum)
,e(N)
ande(df_r)
are saved scalar results accessible after running commands.res
is the option that instructspredict
to produce plain residuals, not predicted values.The very minor discrepancy in the second case I trust to be trivial.
If compatibility with R or anything else continues to worry, I'd recommend a test suite not dependent on either's random number generator.