Interpret GLM without intercept

biostatisticsgeneralized linear modelhypothesis testingrregression

I have a question about the output of my glm model WITHOUT an intercept. I am comparing the number of infected leaves on plants in different months. In the case of a model WITH an intercept (using the default log link of the Poisson model), the (Intercept) should represent the log of the mean number of infected leaves in the reference month. The regression coefficients for the non-reference months are the differences in the log of the mean counts of each month from the reference month. I don't want to include a reference group because the output doesn't make much sense. So I removed the intercept from the model using -1.

Here is the model

dat_lambsburg$month <- factor(dat_lambsburg$month, 
    levels = c("May", "November", "June", "July", "August", 
               "September", "October"))

mod_9 <-
  glm(total_count ~  month - 1, family = quasipoisson, 
       data = dat_lambsburg)

summary(mod_9)

Call:
glm(formula = total_count ~ month - 1, family = quasipoisson, 
    data = dat_lambsburg)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-10.8743   -7.6599   -2.2361    0.8373   22.0828  

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
monthMay        -13.3026  4304.2345  -0.003  0.99755    
monthNovember   -13.3026  3043.5534  -0.004  0.99654    
monthJune         0.9163     2.0507   0.447  0.65802    
monthJuly         2.5649     0.8993   2.852  0.00755 ** 
monthAugust       4.5512     0.4711   9.661 5.23e-11 ***
monthSeptember    3.9195     0.4568   8.579 8.36e-10 ***
monthOctober      4.0797     0.4217   9.675 5.06e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 84.10988)

    Null deviance: 10704.9  on 39  degrees of freedom
Residual deviance:  2346.4  on 32  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 11

My question is how to interpret the models without intercept/reference group? The results overall makes sense. That is, significantly more disease from July to October. What does the estimate for May (-13.3026) and November refer to in my case (-13.3026 )? If estimate represents count, how can count be negative? To provide some context, no infected leaves were recorded in May and November and the highest were recorded in October. I have attached raw data figure.

Details about the experiment: I collected positive count data, specifically the number of infected leaves per plant, as part of my experiment. To treat the plants, I placed them in the field for a week using four different treatments. Afterward, I brought them back to a controlled environment, counted the number of infected leaves, and DISCARDED the plants. I repeated this process with fresh plants in the following week. This is not a time series data. Treatments were applied for a week at both locations, so the duration of each treatment was the same. Plot size, treatment duration and sample material were identical.

Best Answer

I'm not surprised your output didn't make much sense with the intercept included. By default, glm uses the alphabetically first category as the base case. In your case that is May, which has few counts and a correspondingly large standard error (see more about this below). That uncertainty will infect all of the contrast coefficients. Basically, for each month you are trying to compare its mean rate to the mean rate of May, which you don't know very well. If you wanted to use an intercept in your model, you should force the model to use the month with the largest count (October, it looks like) as the base case; then you will get reasonable coefficients.

Without the intercept, your coefficients represent a fixed effect for each month. Basically, it is giving you an estimate of the mean rate for each month, transformed by the model's link function. You didn't specify a link function explicitly, so the model used the default, which for the quasipoisson family is log. So, the coefficients are the logs of the monthly mean rates. Another way to get the monthly counts is to use the model's predict method with type = 'response', which will return the values directly.

Finally, for your last question, the large negative coefficients for May and November indicate that the estimate of the mean rate for those months is much less than 1 (i.e., its log is much less than zero). I'm guessing you had no counts at all in those months, which also looks to be the case in your graph. When you try to compute an average rate without any counts, all the model can really say is that the rate must be tiny. Formally it gives you a number, but the number isn't very meaningful, which is what the large standard errors for those terms are telling you.

Related Solutions

Solved – Count Data – Gaussian, poisson or quasipoisson

The answer is likely to be quasipoisson.

This will depend a bit on how much data you have. Is it only slightly more than the number of parameters (12)? Assuming you have at least, say, 24 counts:

When you model data with a poisson distribution, you are saying that the variance of that data is equal to its mean. In other words, if you predict a count of 10000, then the variance of that count is 10000 (std.dev 100).

In real life, that isn't always true. Some data have more variance than this, and some less. It looks like your data has less (if we predict a count of 10000, then the variance appears to be more like 1371 rather than 10000).

Your (non-quasi-)poisson model ignores that fact. It is taking the predictive variance to always be equal to the predictive mean even when the data indicates otherwise. This is why it thinks the parameters are insignificant, because it is highly overstating the predictive variance.

If you only have 13-15 rows of data then it might just be that the poisson glm happens to fit very well and the residuals were smaller than expected.

If the counts are reasonably large, the Gaussian distribution is a good approximation. If some counts are quite small (say, less than 25) then it works less well. Bear in mind also that if you use a Gaussian LM, the effects are additive (observing in November = +1000 birds against June, for example) rather than multiplicative (observing in November = x2 birds against June)

GLM without intercept

You are presumably trying to fit a slope and a trend for each city. You have six cities, so that's six slopes and six trends. But you have attempted to fit a linear model with 13 coefficients, so obviously one of the coefficients must be redundant.

The redundancy is not caused by removing the intercept but rather by trying to force R to estimate 6 interactions terms with your six one-hot coded dummy variables when there are actually only 5 degrees of freedom available for interaction.

Your problem would be easier if you created a factor indicating the city (lets call it City) instead of doing your own coding of dummy variables. The model you presumably want is:

m1 <- glm.nb(casos ~ 0 + City + City:ano + offset(log(populacao)))

which will estimate a separate intercept and slope for each city. An added advantage of this model is that the coefficients for the different cities are statistically independent, so forming confidence intervals for the slopes or for pairwise comparisons between them is easy.

Note the use of : instead of *, which tells R to include only the simple nested interaction without trying to expand out a main effect for ano, which is what has caused all your problems.

I can already tell from the model summary posted in your question what the trend will be for each city. The trend will be positive (upward) for cities 5 and 6, there's no trend for city 3, and the trend is negative (downward) for cities 1, 2 and 4. The slope for city 6 is 0.06878 (P = 0.029). The slope for city 5 is 0.06878 + 0.01989 = 0.08867, which will presumably also be statistically significant. The slope for city 4 is 0.06878 - 0.14021 = -0.07143, which will almost certainly be statistically significant in the opposite direction (you need to fit my model to get the exact p-value). The trends for cities 4 and 6 are significantly different (P = 0.00139), so pooling them would be questionable.

It appears that your cities fall into three groups. If I were you, I would examine the characteristics of the cities, especially their geographical location, to understand why the trends might appear different.

The easiest way to estimate an overall trend averaged across all cities is to fit the simpler model

m1 <- glm.nb(casos ~ 0 + City + ano + offset(log(populacao)))

This model allows a different baseline for each city but assumes a common trend. Almost certainly, the overall trend will not be significant because the conflicting results for the individual cities will cancel out. Concluding no trend may be scientifically questionable however given the significant trends for individual cities.

Best Answer

Related Solutions

Solved – Count Data – Gaussian, poisson or quasipoisson

GLM without intercept

Related Question