Solved – Binomial GLM in R: the same data, but two different models

logisticr

Consider a logistic regression on these data:

X1 X2 Y
1  0  0
1  0  1
0  1  0
0  1  0
0  1  0
0  1  1
1  1  1

R accepts three different representations of the data: one row per table entry, and two condensed representations (one with weights, one with successes and failures). In my mind, these three specifications should all be the same mathematically: the data is the same 7 observations, and they are presented to R in different formats.

data1 <- data.frame(x1=c(1,1,0,0,0,0,1), x2=c(0,0,1,1,1,1,1), y=c(0,1,0,0,0,1,1))
data2 <- data.frame(x1=c(0,1,0,1), x2=c(0,0,1,1), y=c(0,0.5,0.25,1), w=c(0,2,4,1))
data3x <- data.frame(x1=c(0,1,0,1), x2=c(0,0,1,1))
data3y <- cbind(c(0,1,1,1), c(0,1,3,0))

model1 <- glm(y~x1+x2, data=data1, family="binomial")
model2 <- glm(y~x1+x2, data=data2, family="binomial", weight=w)
model3 <- glm(data3y~data3x$x1+data3x$x2, family="binomial")

Models 2 and 3 are the same, which makes sense. But Model 1 is different from model 2 and 3 and I can't reason out why the same data should return different model statistics (coefficients, null and residual deviance) than the others. Models 2 and 3 just use a different representation of the same data.

This might be a red herring, but Model 1 has its coefficients shifted by 4 units compared to Model 2, which is exactly the difference in the number of (populated) rows/residual degrees of freedom between the two.

> model1

Call:  glm(formula = y ~ x1 + x2, family = "binomial", data = data1)

Coefficients:
(Intercept)           x1           x2  
     -19.66        19.66        18.57  

Degrees of Freedom: 6 Total (i.e. Null);  4 Residual
Null Deviance:      9.561 
Residual Deviance: 7.271    AIC: 13.27
> model2

Call:  glm(formula = y ~ x1 + x2, family = "binomial", data = data2, 
    weights = w)

Coefficients:
(Intercept)           x1           x2  
     -23.66        23.66        22.57  

Degrees of Freedom: 2 Total (i.e. Null);  0 Residual
Null Deviance:      2.289 
Residual Deviance: 3.167e-10    AIC: 9.112

Best Answer

The model is

$$\operatorname{E}Y = \frac{1}{1+ \exp[-(\beta_0 + \beta_1 x_1 + \beta_2 x_2)]}$$

& it's saturated, having as many parameters to estimate as the no. distinct covariate patterns. So the equations to solve are as follows:

For $x_1=1$, $x_2 =0$, $\operatorname{E}Y=\frac{1}{2}$

$\beta_0 + \beta_1 = 0$

For $x_1=0$, $x_2 =1$, $\operatorname{E}Y=\frac{1}{4}$

$\beta_0 + \beta_2 = -\log 3$

For $x_1=1$, $x_2=1$, $\operatorname{E}Y=1$

$\beta_0 + \beta_1 + \beta_2 = \infty$

There is quasi-complete separation (if $x_1+x_2>1$ then $E{Y}=1$), so maximum-likelihood estimates of the coefficients are unbounded. But any sufficiently large value $c$ can stand in for infinity, giving the solutions:

$\beta_0 = -(c + \log 3)$

$\beta_1 = c + \log3$

$\beta_2 = c$

I don't know why glm gives up trying to maximize the likelihood at different values for $c$ depending on the data structure, but it's of no practical consequence: predictions & differences in likelihoods will be almost the same.