Solved – For a logistic regression of a 2 by 2 table using `glm` in `R`, is using `cbind` or using a full data matrix for the response the correct method

contingency tableshypothesis testingrregression

For a 2 by 2 table that looks like:

                          Response       No Response
Treatment Given                 25                60
Treatment Not Given             55                43

We may fit a logistic regression by introducing the table as a binomial relationthrough cbind:

glm(formula = cbind(c(25, 60),c(55, 43)) ~ as.factor(c(1, 0)), family = binomial())

It seems from reading through related posts on this topic that another option exists, by using:

DAT <- cbind(c(rep(1, 80), rep(0, 103)), c(rep(1, 25), rep(0,55), rep(1, 60), rep(0, 43))) 

with model:

glm(DAT[,2]~DAT[,1], family = binomial())

Now, the FIRST MODEL outputs:

Call:  glm(formula = cbind(c(25, 60), c(55, 43)) ~ as.factor(c(1, 0)), 
family = binomial())

Coefficients:
    (Intercept)  as.factor(c(1, 0))1  
         0.3331              -1.1216  

Degrees of Freedom: 1 Total (i.e. Null);  0 Residual
Null Deviance:      13.42 
Residual Deviance: -7.55e-15    AIC: 13.75

while the SECOND MODEL outputs:

Call:  glm(formula = DAT[, 2] ~ DAT[, 1], family = binomial())

Coefficients:
    (Intercept)     DAT[, 1]  
         0.3331      -1.1216  

Degrees of Freedom: 182 Total (i.e. Null);  181 Residual
Null Deviance:      252.8 
Residual Deviance: 239.3    AIC: 243.3

The coefficient estimates and p-values are the SAME, but they differ on the degree of freedom and Null/Residual deviance. The question is: are these two models the same or are they actually different? Which one is the correct one? Thanks!

Best Answer

We went through this in a comment chain here, the highlights of which are:

  • The models are equivalent in terms of the estimates and the wald statistics, because the information encoded in these two ways of representing the same data is exactly the same

  • In the first model, you've set it up as only 2 binomial measurements (grouped by the two values of a single binary predictor). So, the observed frequencies can exactly match the modeled frequencies (i.e. your model is saturated), so the residual deviance is zero. Any model that has a different response value for each level of a categorical predictor will (trivially) fit perfectly.

  • That phenomena is not possible in the binary case unless you have complete separation, because the fitted probabilities will never be exactly 0 or 1, which is why the residual deviance is not zero there.

  • In terms of which model to use... You should analyze the data as it arose in the first place, so if each of the outcomes are single binary measurements from different people, analyze it as binary data. But, for example, if each person did ten tasks and you measured the number of successes out of ten, then model cbind(y,10-y) as the outcome. In your case, it seems like single binary outcomes for distinct units (only you know for sure), so you should probably analyze it that way.

Related Question