Solved – Standard errors of estimates vary in PROC REG and PROC GENMOD!

generalized linear modelrregressionsasstandard error

I am trying to match the outputs of PROC REG with PROC GENMOD. I ran a sample test on the 'iris' dataset of R.

The data set is as follows (150 rows in total):

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

My PROC REG code is:

proc reg data=iris;
    model Sepal_Length = Sepal_Width Petal_Length Petal_Width;
run;

My PROC GENMOD code is:

proc genmod data=iris;
    model Sepal_Length = Sepal_Width Petal_Length Petal_Width / dist=normal;
run;

The output for PROC REG is:

                        Parameter       Standard
Variable        DF       Estimate          Error    t Value    Pr > |t|

Intercept        1        1.85600        0.25078       7.40      <.0001
Sepal_Width      1        0.65084        0.06665       9.77      <.0001
Petal_Length     1        0.70913        0.05672      12.50      <.0001
Petal_Width      1       -0.55648        0.12755      -4.36      <.0001

The output for PROC GENMOD is:

                               Standard     Wald 95% Confidence      Chi-
Parameter      DF   Estimate      Error           Limits           Square   Pr > ChiSq

Intercept       1     1.8560     0.2474     1.3711       2.3409     56.28       <.0001
Sepal_Width     1     0.6508     0.0658     0.5220       0.7797     97.98       <.0001
Petal_Length    1     0.7091     0.0560     0.5995       0.8188    160.59       <.0001
Petal_Width     1    -0.5565     0.1258    -0.8031      -0.3098     19.56       <.0001
Scale           1     0.3103     0.0179     0.2771       0.3475

According to my understanding, the standard errors of both codes should match as the generalized linear model is run on a normal distribution.

Also, I ran both regression on R using lm() and glm(..., family=gaussian) and the standard error came out equal. Moreover, they are the same as the standard error of PROC REG.

Can anyone elaborate on why they are not matching?

Best Answer

Whenever you see small inconsistent discrepancies among standard errors or other quantities directly related to variances, suspect bias corrections.

In this case, we are given ample clues. First, consider the ratios of the standard errors:

$$\eqalign{ &(.25078, .06665, .05672, .12755) / (.2474, .0658, .056, .1258)\\& = 1.0137, 1.0129, 1.0129, 1.0139).}$$

Next, consider that the regression itself involves $150$ observations and $4$ variables. A bias-corrected estimate of a variance would therefore involve a ratio of $150$ to $150-4$. Let's see what this correction might do the squares of the standard errors. Multiplying the previous results by $146/150$ gives

$$(1.0001105, 0.9986427, 0.9985228, 1.0006017)$$

Their mean is $0.9995 \pm .0005$, which is essentially $1$. Provisionally, then, it is fair to conclude that both procedures are doing the same thing but using different estimates for the variances of the parameters. Most likely one of them is using Maximum Likelihood estimates (which involve no bias corrections) and the other is using ordinary least squares formulas (which usually include bias corrections). Given that GENMOD is generalized linear model code, and GLMs are (almost) always fit using ML, and given that REG is least squares regression, this conclusion seems well supported.

We should still be a little puzzled by the variation in residual ratios, even though it's small: these ratios differ from $1$ by about $0.1$%. Without anything to go on but experience, I would provisionally attribute these variations to numerical errors associated with incomplete convergence in the likelihood optimization procedure of GENMOD. We're getting close to the default value of the parameter convergence tolerance (XCONV) of $10^{-4}$.

Related Solutions

Logistic – Differences in Output Between SAS’s PROC GENMOD and R’s GLM

I notice several things here.

First, when you enter your data via matrix, all the data have to be the same type. Thus, they are coerced to be the most inclusive type, strings, which in turn are coerced to be factors by default. Note:

testdata <- data.frame(matrix(c("f","Test", 1.75,   16, 0,  16, 0,  1,  1,
...
sapply(testdata, class)
#      sex  vaccine     dose    not_p     para        n      pct   vacnum    sexno
# "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor"

Try using read.table(text='...', sep=",") instead:

testdata <- read.table(text='"f", "Test", 1.75,   16,   0,  16,  0,      1,  1
"m", "Test", 1.75,   15,   1,  16,  6.25,   1,  0
"f", "Test", 2.75,    4,  12,  16, 75,      1,  1
"m", "Test", 2.75,    9,   6,  15, 40,      1,  0
"f", "WHO",  1.75,   15,   1,  16,  6.25,   0,  1
"m", "WHO",  1.75,   14,   2,  16, 12.5,    0,  0
"f", "WHO",  2.75,    2,  13,  15, 86.6667, 0,  1
"m", "WHO",  2.75,    3,  13,  16, 81.25,   0,  0', sep=",")
names(testdata) <- c("sex", "vaccine", "dose", "not_p", "para", "n", "pct", 
                     "vacnum", "sexno")
sapply(testdata, class)
#      sex   vaccine      dose     not_p      para         n       pct    vacnum 
# "factor"  "factor" "numeric" "integer" "integer" "integer" "numeric" "integer" 
#     sexno 
# "integer"

(That was small potatoes.) The next trap to worry about is that SAS and R code logistic regression for binomial data differently. SAS uses "events over trials", but R uses the odds, successes/failures. Thus, your model formula should be:

form <- as.formula("cbind(para, n-para) ~ dose + sex + vacnum")

Finally, you specified family=quasibinomial (i.e., the quasibinomial) in your R code, but \DIST=BIN (i.e., the binomial) in your SAS code. To match the SAS output, use the binomial instead. Thus, your final model is:

fitreduced <- glm(form, family=binomial(link="logit"), data=testdata)
coef(summary(fitreduced))
#               Estimate Std. Error   z value     Pr(>|z|)
# (Intercept) -9.4020028  1.6219570 -5.796703 6.763131e-09
# dose         3.9207805  0.6460193  6.069138 1.285986e-09
# sexf         0.5574087  0.5184112  1.075225 2.822741e-01
# vacnum      -1.3221011  0.5482645 -2.411430 1.589012e-02

This seems to match the SAS estimates and standard errors.

Solved – Why does glm() provide estimates and standard errors on the link scale

Hard to know for sure, but there are a few reasons the link scale is useful.

Using standard errors as a summary of uncertainty is generally more reliable on the link scale, where the domain of the parameters is unbounded and where the assumption that the likelihood surface is approximately quadratic ($\leftrightarrow$ sampling distribution of the parameter estimates is approximately Normal) is more likely to be reasonable. For example, suppose you have a log-link model with estimate (on the link scale) 1.0 and standard error 3.0. On the link scale, the confidence interval is approximately $1 \pm 1.96 \times 3$. If you back-transform, exponentiating the parameter and multiplying the standard error by the exponentiated parameter (as in this answer), and then try to construct symmetric CIs, you get $2.718 \pm 1.96 \times 3 \times 2.718$, which includes negative values ... if you do want to back-transform, it makes more sense to back-transform the confidence intervals, i.e. $\exp(1 \pm 1.96 \times 3)$.
Probably more importantly, for the very common logit link, it's basically impossible to sensibly back-transform the parameters all the way to the data scale (i.e., from logit/log-odds-ratios to probability). It is common to exponentiate parameters to move from the log-odds-ratio to the odds-ratio scale, but you can't go back from odds ratios to probabilities without specifying a baseline value. That is, you can say in general "the odds ratio associated with control vs. treatment is XXX", but the change in probability from control to treatment will depend on other covariates (e.g., the odds ratio for females and males may be the same while the change in probability is different because the baseline risk is different for females and males).

Probably the proximal reason is that because of the issues listed above, most people who do a lot of statistical modeling have gotten used to interpreting parameters on the link scale; most epidemiologists and biostatisticians have to spend time learning about odds ratios and log-odds ratios, and there are lots of papers written about their interpretation. For better or worse, R was written by people who are comfortable interpreting parameters on the link scale. Many downstream packages such as broom have options that will exponentiate parameters and CIs for you (putting them on the data (count) scale for the log link; the odds-ratio scale for logit links; and the hazard-ratio scale for cloglog links).

Best Answer

Related Solutions

Logistic – Differences in Output Between SAS’s PROC GENMOD and R’s GLM

Solved – Why does glm() provide estimates and standard errors on the link scale

Related Question