GLM without intercept

generalized linear modelinterceptnegative-binomial-distributionregressiontime series

I`m using a GLM to determine the influence of climate variables in the incidence of a disease in 6 different cities across time (2007-2020).

I'm using a negative binomial regression, since the dependent variable is a count.

Dependent variable: monthly number of cases (casos)

Offset: monthly population in each city (populacao)

Year (ano) is a independent continuous variable (0, 1, 2 …)

Each city (d1, d2 …) is a independent dummy variable one-hot coded.

dataset

I want to determine the temporal trend of a disease in each city without adjust for climate variables and after adjusting with climate variables.

I'm removing the intercept because I don't want that one city becomes the reference level and I need coefficients for each city.

I have 2 questions:

  1. Why I'm getting NAs in the last coefficient interaction? Are there any errors in my model?

  2. The year (ano) exp(coef) is 1,07. Is the interpretation that there is a upward trend in the incidence of disease across cities correct?

summary(m1 <- glm.nb(casos ~ 0 + ano + ano*(d1 + d2 + d3 + d4 + 
              d5 + d6) + offset(log(populacao)), data = projeto10))

Call:
glm.nb(formula = casos ~ 0 + ano + ano * (d1 + d2 + d3 + d4 + 
    d5 + d6) + offset(log(populacao)), data = projeto10, 
    init.theta = 1.202601993, 
    link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6965  -1.0652  -0.7363   0.3766   4.0045  

Coefficients: (1 not defined because of singularities)
        Estimate Std. Error  z value Pr(>|z|)    
ano      0.06878    0.03153    2.181  0.02916 *  
d1     -10.22251    0.08561 -119.414  < 2e-16 ***
d2     -10.41664    0.08864 -117.510  < 2e-16 ***
d3     -11.44485    0.10843 -105.548  < 2e-16 ***
d4     -11.59067    0.12339  -93.938  < 2e-16 ***
d5     -11.60676    0.12727  -91.198  < 2e-16 ***
d6     -11.77710    0.12766  -92.250  < 2e-16 ***
ano:d1  -0.09080    0.03802   -2.389  0.01692 *  
ano:d2  -0.10842    0.03845   -2.820  0.00480 ** 
ano:d3  -0.06582    0.04144   -1.588  0.11223    
ano:d4  -0.14021    0.04388   -3.196  0.00139 ** 
ano:d5   0.01989    0.04447    0.447  0.65471    
ano:d6        NA         NA       NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(1.2026) family taken to be 1)

    Null deviance: 24450.50  on 1008  degrees of freedom
Residual deviance:   989.11  on  996  degrees of freedom
AIC: 2863

Number of Fisher Scoring iterations: 1


              Theta:  1.203 
          Std. Err.:  0.134 

 2 x log-likelihood:  -2837.036 
> exp(coef(m1))
         ano           d1           d2           d3           d4           d5           d6       ano:d1       ano:d2       ano:d3 
1.071203e+00 3.634290e-05 2.993030e-05 1.070447e-05 9.252024e-06 9.104293e-06 7.678393e-06 9.131966e-01 8.972502e-01 9.362954e-01 
      ano:d4       ano:d5       ano:d6 
8.691727e-01 1.020087e+00           NA 

Best Answer

You are presumably trying to fit a slope and a trend for each city. You have six cities, so that's six slopes and six trends. But you have attempted to fit a linear model with 13 coefficients, so obviously one of the coefficients must be redundant.

The redundancy is not caused by removing the intercept but rather by trying to force R to estimate 6 interactions terms with your six one-hot coded dummy variables when there are actually only 5 degrees of freedom available for interaction.

Your problem would be easier if you created a factor indicating the city (lets call it City) instead of doing your own coding of dummy variables. The model you presumably want is:

m1 <- glm.nb(casos ~ 0 + City + City:ano + offset(log(populacao)))

which will estimate a separate intercept and slope for each city. An added advantage of this model is that the coefficients for the different cities are statistically independent, so forming confidence intervals for the slopes or for pairwise comparisons between them is easy.

Note the use of : instead of *, which tells R to include only the simple nested interaction without trying to expand out a main effect for ano, which is what has caused all your problems.

I can already tell from the model summary posted in your question what the trend will be for each city. The trend will be positive (upward) for cities 5 and 6, there's no trend for city 3, and the trend is negative (downward) for cities 1, 2 and 4. The slope for city 6 is 0.06878 (P = 0.029). The slope for city 5 is 0.06878 + 0.01989 = 0.08867, which will presumably also be statistically significant. The slope for city 4 is 0.06878 - 0.14021 = -0.07143, which will almost certainly be statistically significant in the opposite direction (you need to fit my model to get the exact p-value). The trends for cities 4 and 6 are significantly different (P = 0.00139), so pooling them would be questionable.

It appears that your cities fall into three groups. If I were you, I would examine the characteristics of the cities, especially their geographical location, to understand why the trends might appear different.

The easiest way to estimate an overall trend averaged across all cities is to fit the simpler model

m1 <- glm.nb(casos ~ 0 + City + ano + offset(log(populacao)))

This model allows a different baseline for each city but assumes a common trend. Almost certainly, the overall trend will not be significant because the conflicting results for the individual cities will cancel out. Concluding no trend may be scientifically questionable however given the significant trends for individual cities.

Related Question