Regression – Differences in Output Between SPSS and Stata

regressionspssstata

I'm Stata-proficient but learning SPSS for my new position. I am using a simple dataset to do very basic regressions and comparing to see if the results are the same. They're not. I'm close, but the magnitudes of the betas and significance are slightly different. The data was copy and pasted into each from an Excel; I didn't use a Stata file in SPSS, or vice versa. For SPSS, I did not weight it, it's using listwise deletion, and it's on the "enter" method. I presume Stata is doing the same, as its default (but correct me if I'm wrong and it's a different default!).

Any ideas on what else to check? I'm doing just a simple linear regression.

Data: https://www.dropbox.com/s/8g31cjf8vr69i44/rwj%20county%20data.xls?dl=0

Syntax

SPSS

REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF
/DEPENDENT FreeLunch
/METHOD=ENTER FoodInsecure Rural Female @18 Hispanic.

(Or, for point and click, analyze->regression->linear; forces a choice under "method" for enter/stepwise/remove/backward/forward.)

Stata

reg percentfreelunch percentfoodinsecure rural female under18 hispanic

Data is in Excel and was pasted into both.

Results

SPSS

           Var. |   Unst.B | Std.Err. |   St.B |     t | Sig.
     (Constant) | -139.616 |   66.652 | -2.095 | .045
% Food Insecure |    2.785 |     .674 |   .546 | 4.131 | .000
          Rural |     .131 |     .048 |   .404 | 2.701 | .011
         Female |    2.657 |    1.170 |   .372 | 2.270 | .031
           < 18 |    -.416 |     .583 |  -.145 | -.715 | .480
       Hispanic |    1.156 |     .236 |  1.092 | 4.905 | .000

Stata

               Var. |     Coef. | Std.Err. |     t | P>|t|
percentfoodinsecure |   2.76532 | .6741544 |  4.10 | 0.000
              rural |  .1378976 | .0495354 |  2.78 | 0.009
             female |  2.826711 | 1.204272 |  2.35 | 0.026
            under18 | -.3799895 |  .588423 | -0.65 | 0.523
           hispanic |  1.168375 | .2398765 |  4.87 | 0.000
              _cons | -149.3858 |  69.0891 | -2.16 | 0.039

Best Answer

The problem (amazingly) has to do with rounding the values during pasting.

In Excel, most of the values were computed elsewhere and are recorded as doubles (about 16 decimal places of precision). Only % Food Insecure actually is stored to a small number of decimal places (one). None of the data columns is stored as it appears in Excel. During pasting, the receiving application typically will accept the data as they appear, not as they are actually stored!

Rounding of data matters in this situation because for several variables--especially percent female and percent food insecurity--the amounts rounded off can be an appreciable fraction of the standard deviation of the data.

When I read the Excel data directly in R using xlsx::read.xlsx, I reproduce the SPSS results exactly. When I round the data to integers (for % Free Lunch) and to one decimal place for the others--as they appear when pasting them into R--I get new results, but the estimated coefficients change appreciably. For instance, the intercept of $-139.616$ becomes $-133.897$.

I have not been able to reproduce the Stata results in R (my summary statistics do not quite agree with those presented by Nick Cox: my mean for % Food insecure is $15.67$ instead of $15.81$), but I suspect that if I were to paste them into my copy of Stata, I would get the reported Stata results. (A big clue is the rounded values presented for the minima and maxima: in most cases these are not the minima and maxima actually recorded in the Excel file.)

The differences between the two sets of results are a small fraction of a standard error, so they are--in this statistical sense--of no consequence.

There is no collinearity problem: the VIFs are nice and low.

Moral

When you care about your data, read them directly: do not intervene manually via copy-and-paste or transcription.

Related Question