Solved – Implication of grouped and ungrouped data for Poisson Regression

generalized linear modelpoisson-regressionregression

Poisson regression can be conducted using Grouped and ungrouped data. There should be some differences between these two methods. To be sure about it, I have tried to study the differences using a set of simulated data. The result I found was that the estimated parameters will be the same for both methods, but the residual deviances are very different.

This then bring me to the question if there is any assumption that needs to be satisfied before we can grouped our data.

# Rcode for simulated data #
rm(list=ls())
set.seed(1)
##############################################################
# Creating Random Age, Gender, obs count and population      #
##############################################################
nsim = 10000
age = sample(20:70,size = nsim, replace = T)
Gender = sample(c("M","F"),size = nsim, replace = T)
obs.count = sample(c(0,0,1),size = nsim, replace = T)
population = sample(c(0.7,0.8,0.9,1), size=nsim, replace = T)
ungrouped.data = data.frame(age,Gender,obs.count,population)
grouped.data = aggregate(cbind(ungrouped.data$obs.count,ungrouped.data$population),list(ungrouped.data$age,ungrouped.data$Gender), FUN = "sum")
names(grouped.data) = c("age", "Gender", "obs.count", "population")

############################################
# GLM model for group and ungroup data set #
############################################
model.group = glm(obs.count ~ age + Gender + offset((log(population))), family = poisson, data = grouped.data)
summary(model.group)
model.ungroup = glm(obs.count ~ age + Gender + offset((log(population))), family = poisson, data = ungrouped.data)
summary(model.ungroup)  

Best Answer

Since the sums of counts by combination of factors in the model together with the anti-logged offsets are the sufficient statistics for a Poisson distribution, there should be no difference between the two analyses. Any differing analysis results are due to software-usage errors.

In this case, the problem is that the R glm function does not know what degrees of freedom to use. This can be a problem with some software, when you use sufficient statistics instead of individual observations. For example, PROC NLMIXED in SAS has the DF option in the PROC NLMIXED statement to deal with this type of problem. I am not sure what the equivalent option in glm is, but I assume it exists.

Related Question