Solved – GLM with grouped/aggregated data in R

aggregationgeneralized linear modelr

I would like to fit a GLM to the rate underlying a Poisson process, for data with variable exposure (period of measurement) – and the question is about aggregating/grouping the data before fitting or not. So the model is

$$
\mu = \exp(\beta_0+\beta_1 X_1+\beta_2 X_2)
$$
$$
Y_i \sim Poisson(\mu_i t_i)
$$

So $\mu_i$ is the rate, $t_i$ is the time over which the observations were recorded, and $Y_i$ is the Poissonian distributed number of counts measured in $t_i$. The exposure ($t_i$) should therefore, by my understanding, appear as both an offset in the GLM and a weight (longer observations get more weight). I code this up in R as the following:

#generate the data:
numsamples<-50000
x1<-sample(1:20,numsamples,replace=T)
x2<-sample(1:10,numsamples,replace=T)
t<-1/sample(1:10,numsamples,replace=T)  #exposure time

mu_rate<- exp(0.1+0.04*x1+0.025*x2)  #log linear rate
#generate the count data:
y<-rpois(numsamples,mu_rate*t)
#combine the data into a data frame
df <- data.frame(y=y,x1=x1,x2=x2,t=t)

#fit a glm:
glm1<-glm(y~x1+x2,data=df,family=poisson(link ="log"),
    offset=log(df$t),weights=df$t/max(df$t))


#aggregate data with identical variables - sum both y and t
df_agg<-aggregate(cbind(y,t)~x1+x2,data=df,FUN=sum)

#fit a glm to the aggregated data
glm1<-glm(y~x1+x2,data=df_agg,family=poisson(link ="log"),
    offset=log(df_agg$t),weights=df_agg$t/max(df_agg$t))

Here I have fit to the raw data, and also aggregated data with identical values of $X_1$ and $X_2$ by summing the total count $Y$ and exposure $t$.

Now, my understanding is that aggregating the data should make no difference to the fit (provided that one offsets and weights appropriately by the newly aggregated exposure $t$) – see, for example, page 10 of http://data.princeton.edu/wws509/notes/c4.pdf .
However, the two glm fits give different coefficients (e.g.: 0.1051 vs 0.1065 for the intercept). Admittedly, here, the difference is less than the standard error in the coefficients – however a) I would have expected no difference at all except for machine precision errors, and b) on more complicated data sets which I can't replicate here, the discrepancy is considerably larger than the standard error. Increasing the maximum iterations and decreasing the tolerence (epsilon) seemed to have no effect.

So I guess, my question boils down to a) is there something wrong with the way I have offset/weighted my data or done the aggregation? and b) is there something wrong with my expectation of obtaining the identical fit parameters with the aggregated data?

Thanks, in advance.

Best Answer

In case anyone finds this - the answer is that I was double counting the exposure. The exposure should appear as an offset only in the Poisson GLM, and not as a weight. Doing this results in consistent results between the original data and the aggregated set.

The correct way (use an exposure/offset $s_i$)

Model $\log \lambda_i = \log s_i + \theta^T x$ so that $\lambda_i = s_i e^{\theta^Tx}$. This makes complete sense: the exposure $s_i$ just multiplies the $\hat{\lambda_i}=e^{\hat{\theta}^Tx}$ in a Poisson regression model without different exposures.

We model the random variable $Y$, a response to $x_i$, with a Poisson distribution with parameter $\lambda_i$.

Then the likelihood for $N$ data points is:

$$\prod_{i=1}^N \dfrac{(s_ie^{\theta^Tx})^{y_i}}{y_i!}e^{-s_i e^{\theta^Tx}}$$

The log likelihood $\ell$, keeping only terms that depend on $\theta$ since others will drop out after differentiation:

$$\ell = \displaystyle \sum_{i=1}^N\big[ y_i\theta_Tx_i -s_i e^{\theta^Tx_i}\big]$$

The incorrect way (using $y_i/s_i$ as the y-values)

Now we still model:

$$\log \lambda_i = \log s_i + \theta^T x$$

The difference is that now we assume $y_i/s_i$ has a Poisson distribution. This is essentially what makes the model incorrect. It violates the assumption that $y_i$ has a Poisson distribution. Now you are modeling the rate as having a Poisson distribution. So the likelihood is now:

$$\prod_{i=1}^N \dfrac{(e^{\theta^Tx})^{y_i/s_i}}{(y_i/s_i)!}e^{- e^{\theta^Tx}}$$

[Awkward to have $y_i/s_i$ in the factorial term but it drops out anyway after differentiation of the log likelihood so let's carry on.]

The log likelihood $\hat{\ell}$, keeping only terms that depend on $\theta$ since others will drop out after differentiation:

$$\hat{\ell} = \displaystyle \sum_{i=1}^N\bigg[ \frac{y_i}{s_i}\theta_Tx_i - e^{\theta^Tx_i}\bigg]$$

Conclusion

$\ell$ and $\hat{\ell}$ look strikingly similar, and you might think they are the same, but they are not (you can't just divide by $s_i$ because it is different for every term!)

However, if we consider a weighted Poisson regression when we model $y_i/s_i$ as distributed Poissonian (is that a word?), each data point in the log likelihood gets a weight of $s_i$, then:

$$\hat{\ell}_{{\rm weighted}}=\displaystyle \sum_{i=1}^N s_i[ \frac{y_i}{s_i}\theta_Tx_i - e^{\theta^Tx_i}]$$

is equivalent to $\ell$!

Solved – Regularized GLM with aggregated data

I looked at this, and have some comments. First, you use offsets to represent weights. This are not necessarily the same, so I changed to use weights_column argument. See Difference between: Offset and Weights? (my modified version of your code is given below). But there are other problems. Two, I included the argument earl_stopping=FALSE just to be sure of comparability. Third, there is a standardize argument with default value TRUE, it should be set to FALSE. If you want to standardize it must be done BEFORE aggregation, else the coefficients will have different meaning in the two cases. But still, I get different results, to me it seems like the standardize argument have no effect. That would be an error in the h20 package which you should investigate and report. And maybe try some other package, like glmnet.

My modified code:

library(simstudy)  # On CRAN 
set.seed(7*11*13)

# simulate a dataset with 2 variables - 'Male' and 'Target'
def <- defData(varname = "Male", dist = "binary", formula = 0.4)
def <- defData(def, varname = "Target",dist = "binary",formula = "ifelse(Male == 1, 0.3, 0.7)")

# generate the dataset
dt <- genData(50000,def)

# variable to indicate that each row is 1 life
dt$n = 1

dt$Male <- as.factor(dt$Male)
dt <- as.data.frame(dt)

# create an aggregated version of data
dt2 <- aggregate(x = dt[c('Target','n')],
                by = list(Male = dt$Male),
                FUN = sum)

# add a weight column to aggregated data  exposure
dt2$weight <- dt2$n

# initialise h2o
library(h2o)  on CRAN  
h2o.init(nthreads = -1)
dt <-  as.h2o(dt)
dt2 <- as.h2o(dt2)

# fit GLM to full data, no regularization
mod1 <- h2o.glm(x = 'Male', y = 'Target',training_frame = dt,family = 'poisson', lambda = 0,alpha = 0,seed = 7*11*13, early_stopping=FALSE, standardize=FALSE)
round(mod1@model$coefficients,5)

# mod2 - use aggregated data with a weight column
mod2 <- h2o.glm(x = 'Male', y = 'Target', training_frame = dt2, family = 'poisson', lambda = 0, alpha = 0, weights_column = 'weight', seed = 7*11*13, early_stopping=FALSE, standardize=TRUE)
round(mod2@model$coefficients,5)

# now repeat, with regularization. First, full data
mod3 <- h2o.glm(x = 'Male', y = 'Target',training_frame = dt,family = 'poisson',
               lambda = 0.01,alpha = 0,seed = 7*9*13, early_stopping=FALSE, standardize=FALSE)
round(mod3@model$coefficients,5)

#Intercept    Male.0    Male.1 
#-0.76286   0.39547  -0.39547 

# now on aggregated data
mod4 <- h2o.glm(x = 'Male', y = 'Target',training_frame = dt2,family = 'poisson', lambda = 0.01,alpha = 0, weights_column = 'weight', seed = 7*11*13, early_stopping=FALSE, standardize=FALSE)
round(mod4@model$coefficients,5)

Best Answer

Related Solutions

Poisson Regression – How is a Poisson Rate Regression Equal to a Poisson Regression with Corresponding Offset Term?

The correct way (use an exposure/offset $s_i$)

The incorrect way (using $y_i/s_i$ as the y-values)

Conclusion

Solved – Regularized GLM with aggregated data

Related Question