Solved – Regularized GLM with aggregated data

aggregationgeneralized linear modelglmneth2oregularization

I am fitting a poisson GLM to model claim rates. Since I have 1.5m+ records, I have aggregated my data (to improve efficiency). My understanding is that using aggregated data with a poisson GLM will not effect the coefficient estimates. Indeed, if I fit a GLM with no regularization (and using the log of exposure as an offset) I can exactly recreate the same coefficient estimates using either the full dataset or aggregated data.

However, as soon as I introduce a lambda penalty, the coefficients change depending on whether I use full individual or aggregated data. The larger the penalty, the larger the difference in coefficients.

My question is – should I be able to get the same coefficient estimates using aggregated data with a lambda penalty as I do with individual data? And finally, if not, is it 'ok' to use regularization with aggregated data?

For info, i'm using the GLM function from H2O.ai, in R Studio. Here is some reproducible code (for simplicity i've only simulated 50k records here):

library(simstudy)
set.seed(123)

# simulate a dataset with 2 variables - 'Male' and 'Target'
def = defData(varname = "Male", dist = "binary", formula = 0.4)
def = defData(def, varname = "Target",dist = "binary",formula = "ifelse(Male == 1, 0.3, 0.7)")

# generate the dataset
dt = genData(50000,def)

# variable to indicate that each row represents 1 exposure period
dt$n = 1

dt$Male = as.factor(dt$Male)
dt = as.data.frame(dt)

# create an aggregated version of data
dt2 = aggregate(x = dt[c('Target','n')],
                by = list(Male = dt$Male),
                FUN = sum)

# add an offset column to aggregated data - log of exposure
dt2$offset = log(dt2$n)

# initialise h2o
library(h2o)
h2o.init(nthreads = -1)
dt = as.h2o(dt)
dt2 = as.h2o(dt2)

# fit GLM to full data, no regularization
mod1 = h2o.glm(x = 'Male', y = 'Target',training_frame = dt,family = 'poisson',
               lambda = 0,alpha = 0,seed = 123)
round(mod1@model$coefficients,5)

#Intercept    Male.1 
#-0.35798  -0.83271 

# mod2 - use aggregated data with an offset
mod2 = h2o.glm(x = 'Male', y = 'Target',training_frame = dt2,family = 'poisson',
               lambda = 0,alpha = 0,offset_column = 'offset',seed = 123)
round(mod2@model$coefficients,5)

#Intercept    Male.1 
#-0.35798  -0.83271 

# now repeat, with regularization. First, full data
mod3 = h2o.glm(x = 'Male', y = 'Target',training_frame = dt,family = 'poisson',
               lambda = 0.01,alpha = 0,seed = 123)
round(mod3@model$coefficients,5)

#Intercept    Male.0    Male.1 
#-0.76286   0.39547  -0.39547 

# now on aggregated data
mod4 = h2o.glm(x = 'Male', y = 'Target',training_frame = dt2,family = 'poisson',
               lambda = 0.01,alpha = 0,offset_column = 'offset',seed = 123)
round(mod4@model$coefficients,5)

#Intercept    Male.0    Male.1 
#-0.77433   0.41635  -0.41635 

UPDATE:

I have been through the H2O GLM documentation and can't find anything obviously helpful in this case. I have also tried removing collinear columns in the models above, removing any standardizing and ensuring that the number of training iterations is the same for each version of the model – all with no luck in terms of reproducibility.

Here is a link to the h2o GLM documentation:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html

In addition, I have recreated the same issue using the R 'glmnet' package as well as h2o. I can post code if necessary.

(Have amended question to reflect that i'm modelling rate data)

UPDATE:

I'm also recreated the issue using the glmnet R package. See below

 # using a different GLM package - glmnet
    library(simstudy)
    # simulate a dataset
    set.seed(123)

# simulate a dataset with 3 variables - 'Male', 'Occupation' and 'Target'
def = defData(varname = "Male", dist = "binary", formula = 0.4)
def = defData(def,varname = "Occupation", dist = "categorical", formula = "0.2;0.4;0.4")
def = defData(def, varname = "Target",dist = "binary",formula = "ifelse(Male == 1, 0.3, 0.7)")

# generate the dataset
dt = genData(50000,def)

# variable to indicate exposure for each record
n = runif(nrow(dt), 1,3)
dt$n = n

dt$Male = as.factor(dt$Male)
dt$Occupation = as.factor(dt$Occupation)
dt = as.data.frame(dt)

# create an aggregated version of data
dt2 = aggregate(x = dt[c('Target','n')],
                by = list(Male = dt$Male,
                          Occupation = dt$Occupation),
                FUN = sum)

library(glmnet)

x1 = as.matrix(dt[2:3])
y1 = as.matrix(dt$Target)

fit1 = glmnet(x = x1,y = y1,family = "poisson",alpha = 0, lambda = 0.1, offset = log(dt$n), standardize = F)
coef(fit1)

# now using aggregated data
x2 = as.matrix(dt2[1:2])
y2 = as.matrix(dt2$Target)

fit2 = glmnet(x = x2,y = y2,family = "poisson",alpha = 0, lambda = 0.1, offset = log(dt2$n), standardize = F)
coef(fit2)

UPDATE:

I have been doing a bit more digging into this for anyone who's interested.
I used the 'lambda_search' option set to TRUE (in h2o.glm) to see how h2o would automatically search over lambda to find the best model fit. I then used h2o.getGLMFullRegularizationPath() to check the different values of lambda that h2o was testing. I noticed that the lambdas h2o selected for the aggregated model were always 25000 times the value of the lambdas selected for the original model. Since my dataset has been reduced by a factor of 25000 (50000 records to 2 records) it seems my lambda needs to change accordingly. Indeed, when i've tested the two models again I can recreate the regularized original model almost exactly using aggregated data if lambda is set to be original lambda * nrow(full data)/nrow(aggregated data). For example:

mod3 = h2o.glm(x = 'Male', y = 'Target',training_frame = dt,family = 'poisson',
               lambda = 0.1,alpha = 0,seed = 123, offset_column = 'offset',standardize = F)
mod3@model$coefficients,5

# Gives result:
#Intercept     Male.0     Male.1 
#-1.3984380  0.2773347 -0.2773347 

Then, if I try and recreate this model using aggregated data with regularization:

# now on aggregated data
lambda = 0.1 * nrow(dt)/nrow(dt2)
mod4 = h2o.glm(x = 'Male', y = 'Target',training_frame = dt2,family = 'poisson',
               lambda = lambda,alpha = 0,offset_column = 'offset',seed = 123,standardize = F)
mod4@model$coefficients

# gives result:
# Intercept     Male.0     Male.1 
# -1.3984384  0.2773349 -0.2773349 

I am not quite sure how the maths works here so if anyone could help explain that it would be much appreciated.

Best Answer

I looked at this, and have some comments. First, you use offsets to represent weights. This are not necessarily the same, so I changed to use weights_column argument. See Difference between: Offset and Weights? (my modified version of your code is given below). But there are other problems. Two, I included the argument earl_stopping=FALSE just to be sure of comparability. Third, there is a standardize argument with default value TRUE, it should be set to FALSE. If you want to standardize it must be done BEFORE aggregation, else the coefficients will have different meaning in the two cases. But still, I get different results, to me it seems like the standardize argument have no effect. That would be an error in the h20 package which you should investigate and report. And maybe try some other package, like glmnet.

My modified code:

library(simstudy)  # On CRAN 
set.seed(7*11*13)

# simulate a dataset with 2 variables - 'Male' and 'Target'
def <- defData(varname = "Male", dist = "binary", formula = 0.4)
def <- defData(def, varname = "Target",dist = "binary",formula = "ifelse(Male == 1, 0.3, 0.7)")

# generate the dataset
dt <- genData(50000,def)

# variable to indicate that each row is 1 life
dt$n = 1

dt$Male <- as.factor(dt$Male)
dt <- as.data.frame(dt)

# create an aggregated version of data
dt2 <- aggregate(x = dt[c('Target','n')],
                by = list(Male = dt$Male),
                FUN = sum)

# add a weight column to aggregated data  exposure
dt2$weight <- dt2$n

# initialise h2o
library(h2o)  on CRAN  
h2o.init(nthreads = -1)
dt <-  as.h2o(dt)
dt2 <- as.h2o(dt2)

# fit GLM to full data, no regularization
mod1 <- h2o.glm(x = 'Male', y = 'Target',training_frame = dt,family = 'poisson', lambda = 0,alpha = 0,seed = 7*11*13, early_stopping=FALSE, standardize=FALSE)
round(mod1@model$coefficients,5)

# mod2 - use aggregated data with a weight column
mod2 <- h2o.glm(x = 'Male', y = 'Target', training_frame = dt2, family = 'poisson', lambda = 0, alpha = 0, weights_column = 'weight', seed = 7*11*13, early_stopping=FALSE, standardize=TRUE)
round(mod2@model$coefficients,5)

# now repeat, with regularization. First, full data
mod3 <- h2o.glm(x = 'Male', y = 'Target',training_frame = dt,family = 'poisson',
               lambda = 0.01,alpha = 0,seed = 7*9*13, early_stopping=FALSE, standardize=FALSE)
round(mod3@model$coefficients,5)

#Intercept    Male.0    Male.1 
#-0.76286   0.39547  -0.39547 

# now on aggregated data
mod4 <- h2o.glm(x = 'Male', y = 'Target',training_frame = dt2,family = 'poisson', lambda = 0.01,alpha = 0, weights_column = 'weight', seed = 7*11*13, early_stopping=FALSE, standardize=FALSE)
round(mod4@model$coefficients,5)
Related Question