Solved – Post-hoc test after a GLM with binary data

binomial distributiongeneralized linear modelpost-hoc

I have run various logistic regressions (GLM) from the binomial family and they have produced some very interesting results. I would now like to run post-hoc tests to find out which levels of the explanatory variables are significant, but I am finding it very difficult to find a post-hoc test that is compatible with my data.

I am looking at a population of skeletons. I have recorded the presence/absence of infectious lesions at the level of the individual. I have a response variable (Periostitis, which is a type of lesion and which has two levels: present and absent) and I have four explanatory variables (Age.Group, Sex, Cemetery, and Time.Period). These represent the age of the individual (A, B, C, and D), the sex of the individual (Male or Female), the cemetery the individual was recovered from (Cem1, Cem2, and Cem3), and the time period the individual is dated to (TP1 and TP2), respectivly.

I have run logistic regressions because my data is binomial. An example of which can be seen here:

Call:
glm(formula = Periosna$Periostitis ~ Periosna$Cemetery, family = binomial)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9103   0.5931   0.6841   0.6841   0.9537  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)             1.3332     0.2087   6.387 1.69e-10 ***
Periosna$CemeteryCem2   0.3155     0.5311   0.594   0.5525    
    Periosna$CemeteryCem3  -0.7811     0.3557  -2.196   0.0281 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 244.03  on 221  degrees of freedom
Residual deviance: 238.04  on 219  degrees of freedom
AIC: 244.04

Number of Fisher Scoring iterations: 4

I now want to run a post-hoc test to see which levels (Cem1, Cem2, or Cem3) are significant. I have read everything on this site and on others and have tried to run the glht function from the multcomp package. I see that there are different types of tests (matrix, character, expression, mcp, and mlf), but I do not know which to use. There is a lot of literature online about mcp, but I do not think that I can use this as my data is binomial. When I try to use it, I get an error code:

> library(multcomp)
> 
> results<-glht(modelA,linfct=mcp(Cemetery="Tukey"))
Error in mcp2matrix(model, linfct = linfct) : 
  Variable(s) ‘Cemetery’ have been specified in ‘linfct’ but cannot be     found in ‘model’! 

What would you suggest I use to analyze the levels of my explanatory variables?

Thank you very much for your help. I am stumped!

Best Answer

The problem you haveg is that are abusing (misusing) the formula notation. You never use a formula like this:

Periosna$Periostitis ~ Periosna$Cemetery

because, as you've found out it, it totally breaks the logic that you want to then use for subsequent operations.

You asked mcp() to set up Tukey contrasts for the covariate Cemetery, yet you fitted a model with covariate name Periosna$Cemetery, hence it quite rightly told you that a variable of the stated name was not involved in the model fit.

Instead, the call should have been

glm(Periostitis ~ Cemetery, data = Periosna, family = binomial)

noting that we use the variable names directly, as they exist in Periosna, and we tell R where to locate the variables in the formula via the data argument.

Now when you call glht() it should be able to find the Cemetery variable.