Solved – GLM with grouped/aggregated data in R

aggregationgeneralized linear modelr

I would like to fit a GLM to the rate underlying a Poisson process, for data with variable exposure (period of measurement) – and the question is about aggregating/grouping the data before fitting or not. So the model is

$$
\mu = \exp(\beta_0+\beta_1 X_1+\beta_2 X_2)
$$
$$
Y_i \sim Poisson(\mu_i t_i)
$$

So $\mu_i$ is the rate, $t_i$ is the time over which the observations were recorded, and $Y_i$ is the Poissonian distributed number of counts measured in $t_i$. The exposure ($t_i$) should therefore, by my understanding, appear as both an offset in the GLM and a weight (longer observations get more weight). I code this up in R as the following:

#generate the data:
numsamples<-50000
x1<-sample(1:20,numsamples,replace=T)
x2<-sample(1:10,numsamples,replace=T)
t<-1/sample(1:10,numsamples,replace=T)  #exposure time

mu_rate<- exp(0.1+0.04*x1+0.025*x2)  #log linear rate
#generate the count data:
y<-rpois(numsamples,mu_rate*t)
#combine the data into a data frame
df <- data.frame(y=y,x1=x1,x2=x2,t=t)

#fit a glm:
glm1<-glm(y~x1+x2,data=df,family=poisson(link ="log"),
    offset=log(df$t),weights=df$t/max(df$t))


#aggregate data with identical variables - sum both y and t
df_agg<-aggregate(cbind(y,t)~x1+x2,data=df,FUN=sum)

#fit a glm to the aggregated data
glm1<-glm(y~x1+x2,data=df_agg,family=poisson(link ="log"),
    offset=log(df_agg$t),weights=df_agg$t/max(df_agg$t))    

Here I have fit to the raw data, and also aggregated data with identical values of $X_1$ and $X_2$ by summing the total count $Y$ and exposure $t$.

Now, my understanding is that aggregating the data should make no difference to the fit (provided that one offsets and weights appropriately by the newly aggregated exposure $t$) – see, for example, page 10 of http://data.princeton.edu/wws509/notes/c4.pdf .
However, the two glm fits give different coefficients (e.g.: 0.1051 vs 0.1065 for the intercept). Admittedly, here, the difference is less than the standard error in the coefficients – however a) I would have expected no difference at all except for machine precision errors, and b) on more complicated data sets which I can't replicate here, the discrepancy is considerably larger than the standard error. Increasing the maximum iterations and decreasing the tolerence (epsilon) seemed to have no effect.

So I guess, my question boils down to a) is there something wrong with the way I have offset/weighted my data or done the aggregation? and b) is there something wrong with my expectation of obtaining the identical fit parameters with the aggregated data?

Thanks, in advance.

Best Answer

In case anyone finds this - the answer is that I was double counting the exposure. The exposure should appear as an offset only in the Poisson GLM, and not as a weight. Doing this results in consistent results between the original data and the aggregated set.

Related Question