Generalized Linear Model – Count Explanatory Variable with Proportion Dependent Variable

count-datageneralized linear modelregression

I am having a little trouble coming up with a way of analyzing my data. If there is a short answer (i.e., "use logistic regression, dummy") you can just post that and I'll do some digging on my own – I just need to be pointed in the right direction…

My independent variable is a count and my dependent variable is a ratio. Here is the data:

success <- c(322,358,323,277)
total.trials <- c(540,533,507,540)
count = c(23,13,21,39)
ratio <- success/total.trials

IIRC, It's wrong to do a simple linear regression of ratio ~ count… so what method should I utilize here? Thanks for the help.


Okay, so here's some of the code I ran after following gung's advice of employing the use of the GEE:

subject <- c(1, 2, 3, 4)
success <- c(322, 358, 323, 277)
total <- c(540, 533, 507, 540)
count <- c(23, 13, 21, 39)
data <- cbind(success,total)

gee.model <- gee(data ~ count, id = subject, family = 'binomial')

summary(gee.model)

GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998) 

Model:
Link:                      Logit 
Variance to Mean Relation: Binomial 
Correlation Structure:     Independent 

Call:
gee(formula = data ~ count, id = subject, family = "binomial")

Summary of Residuals:
     Min       1Q   Median       3Q      Max 
  276.6608 310.3817 322.1195 331.3620 357.5969 


Coefficients:
               Estimate  Naive S.E.   Naive z  Robust S.E.  Robust z
(Intercept) -0.25516680 0.031437649 -8.116599 0.0134033383 -19.03756
count       -0.01055972 0.001244121 -8.487698 0.0002616798 -40.35360

Estimated Scale Parameter:  0.1066564
Number of Iterations:  1

Working Correlation
     [,1]
[1,]    1

Does this look correct? And, if I am interpreting it correctly, there is a significant effect of count on the proportion.

Best Answer

You have a binary response. That is the important part of this. The count status of your explanatory variable doesn't matter. As a result, you should be doing some form of logistic regression. The part that makes this more difficult is that your data are clustered within just four participants. That means you need to either use a GLiMeM, or the GEE. This is a subtle decision, but I discuss it at some length here: Difference between generalized linear models & generalized linear mixed models in SPSS. Depending on the options that your software affords you, you may also have to un-group your data, so that you have a (very long) matrix where the response listed in each row is a 1 or a 0.

Related Question