Solved – the “root MSE” in Stata

linear modelmseregressionstata

I have a question that has been confusing me ever since I took econometrics last year. What does the "root MSE" mean in Stata output when you regress a OLS model?

I know that it translates into "root mean squared error", but which variable's mean squared error is it after all, and how is it calculated? Can anybody provide a precise definition and formula, and explain why it is helpful to have that value?

Best Answer

Calculate the difference between the observed and predicted dependent variables
Square them
Add them up, this will give you the "Error sum of squares," SS in Stata output
Divide it by the error's degrees of freedom, this will give you the "Mean error sum of squares," MS in Stata output
Take a square root of it, and this is the Root MSE
Done

If you look at the Stata output:

. sysuse auto, clear
(1978 Automobile Data)

. reg mpg weight

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  1,    72) =  134.62
       Model |   1591.9902     1   1591.9902           Prob > F      =  0.0000
    Residual |  851.469256    72  11.8259619           R-squared     =  0.6515
-------------+------------------------------           Adj R-squared =  0.6467
       Total |  2443.45946    73  33.4720474           Root MSE      =  3.4389

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |  -.0060087   .0005179   -11.60   0.000    -.0070411   -.0049763
       _cons |   39.44028   1.614003    24.44   0.000     36.22283    42.65774
------------------------------------------------------------------------------

Dividing the sum of squares of the residual (851.469) by its degrees of freedom (72) yields 11.826. That is the mean sum of squares. If you further take a square root, you'll get Root MSE (3.4289 in the output).

Basically, it's a measurement of accuracy. The more accurate model would have less error, leading to a smaller error sum of squares, then MS, then Root MSE. However, you can only apply this comparison within the same dependent variables, because MS and Root MSE are not standardized. Depending on the unit of measurements, Root MSE can vary greatly.

Related Solutions

Solved – Mean squared error definition

I might be telling you something you already know, but keep in mind that really

$\hat{f}(x)=\hat{f}(x,\{X_k\})$,

where $\{X_k\}$ is the set of sample points over which you build your estimate. For most non-parametric estimators, the $X_k$ are assumed independent, and the method is additive, so you can just look at the $MSE$ of $\hat{f}(x,X_k)$ and then take an average.

Then your formula is interpreted as

$MSE(\hat f(x)) = E[(\hat f(x)-f(x))^{2}]=\int_\Omega(\hat{f}(x,z)-f(x))^2f(z)dz$,

which yes, is the MSE error at the point $x$.

As for the $MSEP$, I'm not entirely sure what your question is, but there are surely various ways to predict this. If you want to know the error expected at $x^*$, then I guess it probably does line up with the $MSE$, however, you might want to know for example the prediction error for a random $X^*$, in which case you might assume it is drawn from a $f(x)$ distribution, then you would want the $MISE$.

Hope that helps clarify something.

Solved – How does Stata calculate RMSE in regression with weights

The essence of this I do take to be statistical, rather than a pure programming problem.

As you don't use weights in your manual calculation, the lack of agreement is at first sight not surprising, but there are other problems too, some but not all trivial.

Random number generation. Although you carefully set seed note also that Stata's random number generator changed in Stata 14 (http://www.stata.com/help.cgi?whatsnew13to14 #15). Below I am using Stata 10.1 on an old computer. This just affects which numbers are generated and should be separate from getting the same answers out of regress and a hand-calculation. You should able to reproduce what I am doing with version control.
Using double precision consistently. regress uses double precision, so hand calculations should follow along.
Using weights that make sense. Your weights by construction have expectation zero, but Stata ignores negative weights, typically about half of them. We could adjust for that by comparing like with like, but it's cleaner to generate positive weights directly.
Using the same formula. This is what interests you.

The code here doesn't follow yours. In some cases I have used slightly more direct routes. Note that scalars hold more precision than locals. The number of decimal places here is ridiculous for every purpose except the present purpose of reproducing a calculation.

. clear

. set seed 12345

. drawnorm x e, n(1000)
(obs 1000)

. gen y = 5+(2*x)+e

. gen double w = runiform()

. reg y x

      Source |       SS       df       MS              Number of obs =    1000
-------------+------------------------------           F(  1,   998) = 3713.72
       Model |  3846.76283     1  3846.76283           Prob > F      =  0.0000
    Residual |  1033.75252   998  1.03582417           R-squared     =  0.7882
-------------+------------------------------           Adj R-squared =  0.7880
       Total |  4880.51536   999  4.88540076           Root MSE      =  1.0178

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   1.963036   .0322124    60.94   0.000     1.899824    2.026248
       _cons |   5.015532   .0322162   155.68   0.000     4.952313    5.078752
------------------------------------------------------------------------------

. di %23.18f e(rmse) 
   1.017754475398205700

. predict double err, res 

. gen double sqerr = err^2

. su sqerr, meanonly 

. di _n "rmse " %23.18f sqrt(r(sum)/e(df_r)) 

rmse    1.017754475398205700

. reg y x [aw=w] 
(sum of wgt is   4.9129e+02)

      Source |       SS       df       MS              Number of obs =    1000
-------------+------------------------------           F(  1,   998) = 3666.47
       Model |  3913.22661     1  3913.22661           Prob > F      =  0.0000
    Residual |  1065.16666   998  1.06730126           R-squared     =  0.7860
-------------+------------------------------           Adj R-squared =  0.7858
       Total |  4978.39327   999  4.98337665           Root MSE      =  1.0331

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   1.984769   .0327783    60.55   0.000     1.920447    2.049092
       _cons |   5.005039   .0326721   153.19   0.000     4.940925    5.069153
------------------------------------------------------------------------------

. di %23.18f e(rmse) 
   1.033102736537657100

. predict double werr, res 

. su w, meanonly 

. scalar wsum = r(sum) 

. gen double sqwerr = (w/wsum) * e(N) * werr^2

. su sqwerr, meanonly 

. di _n "rmse " %23.18f sqrt(r(sum)/e(df_r)) 

rmse    1.033102736537657300

I regard Stata code as being as self-evident as presumably R users feel also about R. What may help is that r(sum), e(N) and e(df_r) are saved scalar results accessible after running commands. res is the option that instructs predict to produce plain residuals, not predicted values.

The very minor discrepancy in the second case I trust to be trivial.

If compatibility with R or anything else continues to worry, I'd recommend a test suite not dependent on either's random number generator.

Best Answer

Related Solutions

Solved – Mean squared error definition

Solved – How does Stata calculate RMSE in regression with weights

Related Question