I'm Stata-proficient but learning SPSS for my new position. I am using a simple dataset to do very basic regressions and comparing to see if the results are the same. They're not. I'm close, but the magnitudes of the betas and significance are slightly different. The data was copy and pasted into each from an Excel; I didn't use a Stata file in SPSS, or vice versa. For SPSS, I did not weight it, it's using listwise deletion, and it's on the "enter" method. I presume Stata is doing the same, as its default (but correct me if I'm wrong and it's a different default!).
Any ideas on what else to check? I'm doing just a simple linear regression.
Data: https://www.dropbox.com/s/8g31cjf8vr69i44/rwj%20county%20data.xls?dl=0
Syntax
SPSS
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF
/DEPENDENT FreeLunch
/METHOD=ENTER FoodInsecure Rural Female @18 Hispanic.
(Or, for point and click, analyze->regression->linear; forces a choice under "method" for enter/stepwise/remove/backward/forward.)
Stata
reg percentfreelunch percentfoodinsecure rural female under18 hispanic
Data is in Excel and was pasted into both.
Results
SPSS
Var. | Unst.B | Std.Err. | St.B | t | Sig.
(Constant) | -139.616 | 66.652 | -2.095 | .045
% Food Insecure | 2.785 | .674 | .546 | 4.131 | .000
Rural | .131 | .048 | .404 | 2.701 | .011
Female | 2.657 | 1.170 | .372 | 2.270 | .031
< 18 | -.416 | .583 | -.145 | -.715 | .480
Hispanic | 1.156 | .236 | 1.092 | 4.905 | .000
Stata
Var. | Coef. | Std.Err. | t | P>|t|
percentfoodinsecure | 2.76532 | .6741544 | 4.10 | 0.000
rural | .1378976 | .0495354 | 2.78 | 0.009
female | 2.826711 | 1.204272 | 2.35 | 0.026
under18 | -.3799895 | .588423 | -0.65 | 0.523
hispanic | 1.168375 | .2398765 | 4.87 | 0.000
_cons | -149.3858 | 69.0891 | -2.16 | 0.039
Best Answer
The problem (amazingly) has to do with rounding the values during pasting.
In Excel, most of the values were computed elsewhere and are recorded as doubles (about 16 decimal places of precision). Only
% Food Insecure
actually is stored to a small number of decimal places (one). None of the data columns is stored as it appears in Excel. During pasting, the receiving application typically will accept the data as they appear, not as they are actually stored!Rounding of data matters in this situation because for several variables--especially percent female and percent food insecurity--the amounts rounded off can be an appreciable fraction of the standard deviation of the data.
When I read the Excel data directly in
R
usingxlsx::read.xlsx
, I reproduce the SPSS results exactly. When I round the data to integers (for% Free Lunch
) and to one decimal place for the others--as they appear when pasting them intoR
--I get new results, but the estimated coefficients change appreciably. For instance, the intercept of $-139.616$ becomes $-133.897$.I have not been able to reproduce the Stata results in
R
(my summary statistics do not quite agree with those presented by Nick Cox: my mean for% Food insecure
is $15.67$ instead of $15.81$), but I suspect that if I were to paste them into my copy of Stata, I would get the reported Stata results. (A big clue is the rounded values presented for the minima and maxima: in most cases these are not the minima and maxima actually recorded in the Excel file.)The differences between the two sets of results are a small fraction of a standard error, so they are--in this statistical sense--of no consequence.
There is no collinearity problem: the VIFs are nice and low.
Moral
When you care about your data, read them directly: do not intervene manually via copy-and-paste or transcription.