Solved – Implication of grouped and ungrouped data for Poisson Regression

generalized linear modelpoisson-regressionregression

Poisson regression can be conducted using Grouped and ungrouped data. There should be some differences between these two methods. To be sure about it, I have tried to study the differences using a set of simulated data. The result I found was that the estimated parameters will be the same for both methods, but the residual deviances are very different.

This then bring me to the question if there is any assumption that needs to be satisfied before we can grouped our data.

# Rcode for simulated data #
rm(list=ls())
set.seed(1)
##############################################################
# Creating Random Age, Gender, obs count and population      #
##############################################################
nsim = 10000
age = sample(20:70,size = nsim, replace = T)
Gender = sample(c("M","F"),size = nsim, replace = T)
obs.count = sample(c(0,0,1),size = nsim, replace = T)
population = sample(c(0.7,0.8,0.9,1), size=nsim, replace = T)
ungrouped.data = data.frame(age,Gender,obs.count,population)
grouped.data = aggregate(cbind(ungrouped.data$obs.count,ungrouped.data$population),list(ungrouped.data$age,ungrouped.data$Gender), FUN = "sum")
names(grouped.data) = c("age", "Gender", "obs.count", "population")

############################################
# GLM model for group and ungroup data set #
############################################
model.group = glm(obs.count ~ age + Gender + offset((log(population))), family = poisson, data = grouped.data)
summary(model.group)
model.ungroup = glm(obs.count ~ age + Gender + offset((log(population))), family = poisson, data = ungrouped.data)
summary(model.ungroup)

Best Answer

Since the sums of counts by combination of factors in the model together with the anti-logged offsets are the sufficient statistics for a Poisson distribution, there should be no difference between the two analyses. Any differing analysis results are due to software-usage errors.

In this case, the problem is that the R glm function does not know what degrees of freedom to use. This can be a problem with some software, when you use sufficient statistics instead of individual observations. For example, PROC NLMIXED in SAS has the DF option in the PROC NLMIXED statement to deal with this type of problem. I am not sure what the equivalent option in glm is, but I assume it exists.

Related Solutions

Solved – Low sample size: LR vs F – test

The Likelihood ratio test you're using uses a chi-square distribution to approximate the null distribution of likelihoods. This approximation works best with large sample sizes, so its inaccuracy with a small sample size makes some sense.

I see a few options for getting better Type-I error in your situation:

There are corrected versions of the likelihood ratio test, such as Bartlett's correction. I don't know much about these (beyond the fact that they exist), but I've heard that Ben Bolker knows more.
You could estimate the null distribution for the likelihood ratio by bootstrapping. If the observed likelihood ratio falls outside middle 95% of the bootstrap distribution, then it's statistically significant.

Finally, the Poisson distribution has one fewer free parameter than the negative binomial, and might be worth trying when the sample size is very small.

Solved – Modelling mortality rates using Poisson regression

Without seeing the dataset (not available) it seems mostly correct. The nice thing about Poisson regressions is that they can provide rates when used as suggested. One thing that may be worth to keep in mind is that there may be overdispersion where you should switch to a negative binomial regression (see the MASS package).

The Poisson regression doesn't care whether the data as aggregated or not, but in practice non-aggregated data is frail and can cause some unexpected errors. Note that you cannot have surv == 0 for any of the cases. When I've tested the estimates are the same:

set.seed(1)
n <- 1500
data <- 
  data.frame(
    dead = sample(0:1, n, replace = TRUE, prob = c(.9, .1)),
    surv = ceiling(exp(runif(100))*365),
    gender = sample(c("Male", "Female"), n, replace = TRUE),
    diagnosis = sample(0:1, n, replace = TRUE),
    age = sample(60:80, n, replace = TRUE),
    inclusion_year = sample(1998:2011, n, replace = TRUE)
  )

library(dplyr)
model <- 
  data %>% 
  group_by(gender, 
           diagnosis,
           age,
           inclusion_year) %>% 
  summarise(Deaths = sum(dead),
            Person_time = sum(surv)) %>%
  glm(Deaths ~ gender + diagnosis + I(age - 70) + I(inclusion_year - 1998) + offset(log(Person_time/10^3/365.25)), 
      data = . , family = poisson)

alt_model <- glm(dead ~ gender + diagnosis + I(age - 70) + I(inclusion_year - 1998) + offset(log(surv/10^3/365.25)), 
    data = data , family = poisson)
sum(coef(alt_model) - coef(model))
# > 1.779132e-14
sum(abs(confint(alt_model) - confint(model)))
# > 6.013114e-11

As you get a rate it is important to center the variables so that the intercept is interpretable, e.g.:

> exp(coef(model)["(Intercept)"])
(Intercept) 
    51.3771

Can be interpreted as the base rate and then the covariates are rate ratios. If we want the base rate after 10 years:

> exp(coef(model)["(Intercept)"] + coef(model)["I(inclusion_year - 1998)"]*10)
(Intercept) 
     47.427

I've currently modeled the inclusion year as a trend variable but you should probably check for nonlinearities and sometimes it is useful to do a categorization of the time points. I used this approach in this article:

D. Gordon, P. Gillgren, S. Eloranta, H. Olsson, M. Gordon, J. Hansson, and K. E. Smedby, “Time trends in incidence of cutaneous melanoma by detailed anatomical location and patterns of ultraviolet radiation exposure: a retrospective population-based study,” Melanoma Res., vol. 25, no. 4, pp. 348–356, Aug. 2015.

Best Answer

Related Solutions

Solved – Low sample size: LR vs F – test

Solved – Modelling mortality rates using Poisson regression

Related Question