Solved – Logistic Regression using two categorical variables

logisticrregression

I've been doing some work on regression, and a paper in particular caught my attention where they utilised two categorical variables in a logistic regression. From my understanding, if you're using two categorical variables, does that not just come down to conditional probabilities? Out of interest, I wrote some code, and was surprised to see that the method returned a p value (reproducible code below). I'm struggling to understand what the null and alternative hypotheses are in this context, so that I can interpret what the p value is telling me. Any assistance in dissecting the resulting model summary would be greatly appreciated.

Example interpretation:

P(Dep = Class1 | Pred = C) = 20/120 = 1/6

Is the model summary telling me this?

Code

foo <- data.frame(Pred   = c(rep("A",80),rep("B",20),
                             rep("C",40),rep("D",60)),
                  Dep    = c(rep("Class1",120),
                             rep("Class2",80)))
fit      <- glm(Dep ~ Pred, family=binomial(link='logit'), data = foo)
summary(fit)

Output

   Call:
glm(formula = Dep ~ Pred, family = binomial(link = "logit"), 
    data = foo)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.17741  -0.00003  -0.00003   0.00003   1.17741  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.157e+01  3.268e+03  -0.007    0.995
PredB       -7.168e-11  7.308e+03   0.000    1.000
PredC        2.157e+01  3.268e+03   0.007    0.995
PredD        4.313e+01  4.992e+03   0.009    0.993

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 269.205  on 199  degrees of freedom
Residual deviance:  55.452  on 196  degrees of freedom
AIC: 63.452

Number of Fisher Scoring iterations: 20

Best Answer

Your intuition is right.

The logistic regression model gives identical inference to a Pearson chi-sq test of "independence" for categorical data (in the long run). In both cases, the null hypothesis is that the conditional probabilities are equal to the marginal probabilities. You can show with some algebra that this necessarily implies that the odds ratio is 1.

The minute differences in the actual logistic regression and Pearson test statistics owes to how they're computed. The $p$values you get in R from calling summary.glm come from a Wald test whereas the Pearson test is a related Score test.

The problem with your example is that you have a singular logistic model. One advantage of the score test is that it can provide test statistics for models like this one which "explode". For a more sane example, consider the following:

set.seed(123)
foo2 <- as.data.frame(sapply(foo, sample)) ## permute to avoid singularity
fit <- glm(Dep ~ Pred, data=foo2, family=binomial)
library(lmtest)
waldtest(fit, test='Chisq')
chisq.test(table(foo2))

Gives us:

> waldtest(fit, test='Chisq')
Wald test

Model 1: Dep ~ Pred
Model 2: Dep ~ 1
  Res.Df Df  Chisq Pr(>Chisq)
1    196                     
2    199 -3 1.7774     0.6199
> chisq.test(table(foo2))

    Pearson's Chi-squared test

data:  table(foo2)
X-squared = 1.7882, df = 3, p-value = 0.6175
Related Question