R – How to Fit a GLM with Sum to Zero Constraints Without Reference Level

generalized linear modelpoisson distributionr

Question has been rewritten

I am trying to fit a glm to find out how the rate of events happens (counts/exposure) related to some covariates, with Poisson error.
Counts is the number of events happened and the covariates include age, sex, smoking status, policy class and exposure, where policy class has 6 levels: class 0, class 1, …, class 5

I was wondering how to:

  1. keeping the intercept (if possible? if not please would you explain me why not). [Answered by Affine]

  2. making the coefficients of polclass0, …, polclass5 sum to 0, I know somehow I need to use contr.sum, but it doesn't work (see below) [An example from other site, but I couldn't replicated it.

  3. I used the +0 in the formula to make the polclass0 "take over" the intercept. However, is it possible to have other categorial variable (e.g. sex, or some categorial variable that takes 3+ values) to have two all coefficients display?


Example:

    set.seed(123)
    sex <- as.factor(sample(c(0,1), 50, replace=T, prob=c(0.5, 0.5)))
    smoker <- as.factor(sample(c(0,1), 50, replace=T, prob=c(0.3, 0.7)))
    polclass <- as.factor(sample(0:5, 50, replace=T))
    age <- sample(16:80, 50, replace=T)
    count <- rpois(50, 2.5)
    exposure <- rgamma(50, 100)

    glm1 <- glm(count ~ polclass + sex + smoker + age + offset(log(exposure)) + 0 ,
                family=poisson(link='log'), contrasts = list(polclass = contr.sum))

    summary(glm1)


        Call:
        glm(formula = count ~ polclass + sex + smoker + age + offset(log(exposure)) + 
            0, family = poisson(link = "log"), contrasts = list(polclass = contr.sum))

        Deviance Residuals: 
            Min       1Q   Median       3Q      Max  
        -2.3604  -0.5625  -0.1263   0.5246   1.5584  

        Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
        polclass0 -3.719037   0.423105  -8.790   <2e-16 ***
        polclass1 -3.347628   0.328248 -10.198   <2e-16 ***
        polclass2 -3.638992   0.368175  -9.884   <2e-16 ***
        polclass3 -3.833882   0.426819  -8.982   <2e-16 ***
        polclass4 -3.780837   0.399512  -9.464   <2e-16 ***
        polclass5 -3.838007   0.373120 -10.286   <2e-16 ***
        sex1      -0.057738   0.091936  -0.628    0.530    
        smoker1   -0.174328   0.118654  -1.469    0.142    
        age       -0.001846   0.006362  -0.290    0.772    
        ---
        Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

        (Dispersion parameter for poisson family taken to be 1)

            Null deviance: 8764.667  on 50  degrees of freedom
        Residual deviance:   40.865  on 41  degrees of freedom
        AIC: 189

        Number of Fisher Scoring iterations: 5

as you can see, the contr.sum doesn't work and I would like to know why and how to do that exactly?

Thanks in advance!

Best Answer

In short: The intercept takes on the reference level. Each Polclass in the output is the incremental gain over the reference level.

If you want to explore the differences between Polclass, you would use polclass <- relevel(polclass, "1") and then re-run GLM. You'll do this for each reference level.

                Ref0     Ref1     Ref2     Ref3     Ref4     Ref5
(Intercept) -3.95110 -3.57969 -3.87106 -4.06595 -4.01290 -4.07007
sex1         0.11548  0.11548  0.11548  0.11548  0.11548  0.11548
smoker1      0.34866  0.34866  0.34866  0.34866  0.34866  0.34866

polclass0         NA -0.37141 -0.08004  0.11485  0.06180  0.11897
polclass1    0.37141       NA  0.29136  0.48625  0.43321  0.49038
polclass2    0.08004 -0.29136       NA  0.19489  0.14185  0.19902
polclass3   -0.11485 -0.48625 -0.19489       NA -0.05304  0.00413
polclass4   -0.06180 -0.43321 -0.14185  0.05304       NA  0.05717
polclass5   -0.11897 -0.49038 -0.19902 -0.00413 -0.05717       NA

age         -0.00185 -0.00185 -0.00185 -0.00185 -0.00185 -0.00185

I re-ran the model with the different reference levels and then compiled the coefficients in the table above. All other coefficients (except the intercept) stay the same but each polclass changes at the different reference levels. This happens since the intercept takes into account the reference level average.

If you want to examine the impact of the different polclasses, you'd have to look at each model and make some sort of inference based on your understanding of the data.

Related Question