Chi-Squared Test – Discrepancy Between Chi-Squared Results and Logistic Regression

chi-squared-testr

Ok, backstory:

Per my understanding, based on stuff like this, chi squared tests and logistic regressions can often be used for fairly similar purposes.
That's not to say they should have the same results, I could see p-values and stuff being different, but if its a 2×2 table, they can both be used, and it really just depends on the question you want to answer.

Well…. based on that understanding, allow me to introduce my data and my question:

We collected information on a group of people, including whether they spoke spanish as their primary language, and whether they were given instructions in that primary language. Here's that data (and the chi-squared test my R package automatically applies. I blacked out some of the unnecessary data):
enter image description here

Given it was 71% vs 100%, the significant chi squared makes sense to me, and I would expect a logistic regression to see SOMETHING there. Well given this R code:

reg<-glm(instructions___1 ~ engl_primlang, data=data, family = "binomial")
summary(reg)

I get these results:
enter image description here

Not even close to significant. What am I misunderstanding?

Edit based on comment
Added a screenshot of the structure of my data, couldn't post the whole thing here:

enter image description here

Best Answer

This is not really an answer but rather an investigation that ultimately adds to the question. Nothing you did seems wrong to me, and neither your expectation that the glm slope should be significant.

I tried to reproduce your result like this; I didn't get the same, but something similar:

x1 <- c(rep(1,89),rep(2,89))
x2 <- c(rep(0,71),rep(1,18+89))
xd <- cbind(x1,x2)

glmxd <- glm(x2~x1,family="binomial")

> summary(glmxd)

Call:
glm(formula = x2 ~ x1, family = "binomial")

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.67224  -0.67224   0.00005   0.00005   1.78788  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -23.31    1879.42  -0.012    0.990
x1             21.94    1879.42   0.012    0.991

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.429  on 177  degrees of freedom
Residual deviance:  89.623  on 176  degrees of freedom
AIC: 93.623

Number of Fisher Scoring iterations: 19

Another way to do the supposedly same thing is this, which gives yet another result, still as useless as the one before:

x3 <- c(0,1)
x4 <- c(71,89)
n <- c(89,89)
nd <- as.data.frame(cbind(x3,x4,n))

glmn <- glm(x4/n~x3,family=binomial,data=nd, weights=n)

> summary(glmn)

Call:
glm(formula = x4/n ~ x3, family = binomial, data = nd, weights = n)

Deviance Residuals: 
[1]  0  0

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) 1.372e+00  2.639e-01     5.2 1.99e-07 ***
x3          2.582e+01  5.173e+04     0.0        1    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2.6983e+01  on 1  degrees of freedom
Residual deviance: 2.7497e-10  on 0  degrees of freedom
AIC: 8.512

Number of Fisher Scoring iterations: 22

I get just another slightly different but similar result if I code x1 and x2 above as factors. Funnily, this then behaves closer to the second solution with x3 and x4 above, with the intercept highly significant!? But then it indicates 177 and 176 df rather than 1 and 0.

Now I tried out both of these with 88 out of 89 successes for the x-variable:

x3 <- c(0,1)
x4 <- c(71,88)
n <- c(89,89)
nd <- as.data.frame(cbind(x3,x4,n))

glmn <- glm(x4/n~x3,family=binomial,data=nd, weights=n)

> summary(glmn)

Call:
glm(formula = x4/n ~ x3, family = binomial, data = nd, weights = n)

Deviance Residuals: 
[1]  0  0

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.3723     0.2639   5.200 1.99e-07 ***
x3            3.1050     1.0397   2.986  0.00282 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2.0325e+01  on 1  degrees of freedom
Residual deviance: 1.3323e-14  on 0  degrees of freedom
AIC: 10.501

Number of Fisher Scoring iterations: 4

This looks fine. Again coding it with 88 successes but in your way also gives significant p-values that are not exactly the same, and different deviances with different degrees of freedom.

In my opinion something is fishy here.

I suspect that the internal iteration to estimate the GLM gets confused about 89/89, i.e., 100% successes in one group. In fact, I have looked up an algorithm to find the solution somewhere (not sure whether it's the same that glm uses) that has a $\pi_i(1-\pi_i)$ somewhere in the denominator, meaning that if $\pi_i=1$, it can be messed up. Still I am surprised to see different degrees of freedom and slightly different results depending on coding of the data, even in the case with 88/89 successes (I think 0 and 1 are the correct df; it seems that glm interprets the number of observations differently whether they are all given individually or summarised with weights given by n, but I think this should be the same, at least if the x-variables are coded as factors).

I suspect this is a problem with glm in case there are 100% successes for one level of a binary x-variable, but I'm not sure - somebody else help please...

Related Question