Solved – Binomial GLM and different sample sizes

binomial distributionlogisticoverdispersionrregression

I have a data set which consists of binomial proportions, let's say the success rate of converting a customer depending on the advertisement, the customer age, and various other factors.

For some common combinations of covariates, I have a lot of data, and therefore the binomial proportion of successes has low variance. For rare combinations of covariates, however, I have very little data, and therefore the variance of the proportion is high.

The magnitude of differences is very large, for example I might have 1 million trials for some combinations of covariates, and only 50 for others. However, I want to include ALL data in my model and weight it appropriately to get the best model fit.

I've tried to use R to do binomial (logistic) regression using a generalized linear model:

lrfit <- glm ( cbind(converted,not_converted) ~ advertisement + age, family = binomial)

This is a good start because it automatically weights the observations by the number of trials.

However, it's not good enough. Here's why: Let's say you have some observations with 100,000 trials and others with 1,000,000 trials. If you weight by number of trials the latter group is going to receive 10 times the weight. This seems nonsensical, however, because both observations are easily precise enough to receive equal treatment in the model. Clearly you want to penalize groups with only 10 or 100 trials, but as the number of trials gets larger, the weight should stop increasing.

Since in weighted least squares the reciprocal of the error variance is used as the weight, my idea would be to use calculate the posterior variance of the proportion (using Jeffrey's prior), then add some constant term to it (this will make sure the variance stops increasing at a certain number of trials) and then use the reciprocal of the sum as the weight.

Is this approach reasonable? Am I missing something? Can someone give me more information about this method?

Best Answer

Could you get the correct level of inferences from a Generalized Linear Mixed Model? Treating the advertisement as a practical random effect should introduce the correct level of penalization only on the smaller sampled levels. Those levels with any reasonable magnitude will get the appropriate fixed effect estimates.

Assuming the use of lme4 in R, I'm not sure if you'd need to replicate the rows or not. (I'm leaning towards yes). Then you would have something like:

glmm.fit <- glmer(success~ns(age,df=6)+(1|advertisement),family=binomial(),data=rep.df)

(Notice I splined the age using ns() from splines package since I can't imagine it would actually be linear.)

Related Question