Solved – Binomial logistic regression with categorical predictors and interaction (binomial family argument and p-value differences)

generalized linear modellogisticp-valuer

I have a question about significance and differences in significance when I use an interaction plus the family = binomial argument in my glm model and when I leave it out. I am very new to logistic regression, and have only done more simple linear regression in the past.

I have a dataset of observations of tree growth rings, with two categorical explanatory variables (Treatment and Origin). The Treatment variable is an experimental drought treatment with four levels (Control, First Drought, Second Drought, and Two Droughts). The Origin variable has three levels and refers to the tree's origin (given code colors to signify the different origins as Red, Yellow, and Blue). My observations are whether a growth ring is present or not (1 = growth ring present, 0 = no growth ring).

In my case, I am interested in the effect of Treatment, the effect of Origin, and also the possible interaction of Treatment and Origin on growth ring presence.

It has been suggested that binomial logistic regression would be a good method for analyzing this data set. (Hopefully that is appropriate? Maybe there are better methods?)

I have n = 5 (5 observations for each combination of Treatment by Origin. So, for example, 5 observations of growth rings for the Control Treatment Blue Origin trees, 5 observations for the Control Treatment Yellow Origin trees, etc.) So in total there are 60 observations of growth rings in the dataset.

In R, the code that I've used is the glm() function. I've set it up as follows:

growthring_model <- glm(growthringobs ~ Treatment + Origin + Treatment:Origin, data = growthringdata, family = binomial(link = "logit"))

I've factored my explanatory variables so that the Control treatment and the Blue origin trees are my reference.

What I notice is that when I leave the "family = binomial" argument out of the code, it gives me p-values that I would reasonably expect given the results of the data. However, when I add the "family = binomial" argument, the p-values are 1 or very close to 1 (1, 0.998, 0.999, for example). This seems odd. I could see there being low significance, but that the values are ALL so near to 1 makes me suspicious given my actual data. If I run the model without using the "family = binomial" argument, I get p-values that seem to make more sense (even though they are still relatively high/insignificant).

Can someone help me to understand how the binomial argument is shifting my results so much? (I understand that it is referring to the distribution, i.e. my observations are either 1 or 0) What exactly is it changing in the model? Is this a result of low sample size? Is there something in my code? Are those very high-values are correct?

Here is a read out of my model summary with the binomial argument present:

Call: 
glm(formula = Growthring ~ Treatment + Origin + Treatment:Origin, family = binomial(link = "logit"), data = growthringdata)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.79412  -0.00005  -0.00005  -0.00005   1.79412  

Coefficients:
                                       Estimate Std. Error z value  Pr(>|z|)
(Intercept)                          -2.057e+01  7.929e+03  -0.003    0.998
TreatmentFirst Drought               -9.931e-11  1.121e+04   0.000    1.000
TreatmentSecond Drought               1.918e+01  7.929e+03   0.002    0.998
TreatmentTwo Droughts                -1.085e-10  1.121e+04   0.000    1.000
OriginYellow                          1.918e+01  7.929e+03   0.002    0.998
OriginRed                            -1.045e-10  1.121e+04   0.000    1.000
TreatmentFirst Drought:OriginYellow  -1.918e+01  1.373e+04  -0.001    0.999
TreatmentSecond Drought:OriginYellow -1.739e+01  7.929e+03  -0.002    0.998
TreatmentTwo Droughts:OriginYellow   -1.918e+01  1.373e+04  -0.001    0.999
TreatmentFirst Drought:OriginRed      1.038e-10  1.586e+04   0.000    1.000
TreatmentSecond Drought:OriginRed     2.773e+00  1.121e+04   0.000    1.000
TreatmentTwo Droughts:OriginRed       2.016e+01  1.373e+04   0.001    0.999

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 57.169  on 59  degrees of freedom
Residual deviance: 28.472  on 48  degrees of freedom
AIC: 52.472

Number of Fisher Scoring iterations: 19

And here is a read out of my model summary without the binomial argument:

Call:
glm(formula = Growthring ~ Treatment + Origin + Treatment:Origin, 
    data = growthringdata)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
  -0.8     0.0     0.0     0.0     0.8  

Coefficients:
                                       Estimate Std. Error t value Pr(>|t|)  
(Intercept)                          -4.278e-17  1.414e-01   0.000   1.0000  
TreatmentFirst Drought                3.145e-16  2.000e-01   0.000   1.0000  
TreatmentSecond Drought               2.000e-01  2.000e-01   1.000   0.3223  
TreatmentTwo Droughts                 1.152e-16  2.000e-01   0.000   1.0000  
OriginYellow                          2.000e-01  2.000e-01   1.000   0.3223  
OriginRed                             6.879e-17  2.000e-01   0.000   1.0000  
TreatmentFirst Drought:OriginYellow  -2.000e-01  2.828e-01  -0.707   0.4829  
TreatmentSecond Drought:OriginYellow  2.000e-01  2.828e-01   0.707   0.4829  
TreatmentTwo Droughts:OriginYellow   -2.000e-01  2.828e-01  -0.707   0.4829  
TreatmentFirst Drought:OriginRed     -3.243e-16  2.828e-01   0.000   1.0000  
TreatmentSecond Drought:OriginRed     6.000e-01  2.828e-01   2.121   0.0391 *
TreatmentTwo Droughts:OriginRed       4.000e-01  2.828e-01   1.414   0.1638  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.1)

    Null deviance: 8.9833  on 59  degrees of freedom
Residual deviance: 4.8000  on 48  degrees of freedom
AIC: 44.729

Number of Fisher Scoring iterations: 2

EDIT: Here is the read out for the same model without using an interaction term:

Call:
glm(formula = Growthring ~ Treatment + Origin, family = binomial(link = "logit"), 
data = growthringdata)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.80903  -0.51691  -0.12570  -0.00003   2.38811  

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)   
(Intercept)               -4.8369     1.6079  -3.008  0.00263 **
TreatmentFirst Drought   -16.8259  2579.2667  -0.007  0.99480   
TreatmentSecond Drought    3.2826     1.2798   2.565  0.01032 * 
TreatmentTwo Droughts      0.8185     1.3239   0.618  0.53640   
OriginYellow               2.0448     1.3214   1.548  0.12174   
OriginRed                  2.9741     1.3608   2.185  0.02885 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 57.169  on 59  degrees of freedom
Residual deviance: 33.143  on 54  degrees of freedom
AIC: 45.143

Number of Fisher Scoring iterations: 18

Best Answer

To your question

Can someone help me to understand how the binomial argument is shifting my results so much? (I understand that it is referring to the distribution, i.e. my observations are either 1 or 0) What exactly is it changing in the model? Is this a result of low sample size? Is there something in my code? Maybe those very high-values are correct (or not?)?

The default family in glm is the Gaussian distribution. I.e. you get the same as if you called lm. Thus you are minimizing two different likelihoods. The question is quite close to the one here. You outcomes are binary so the logistic likelihood is a more obvious choice here than the Gaussian distribution where outcomes should be able to take values on the whole real line.

Your code does seem correct though you have not posted the data. It seems like there is no effect of your explanatory variables or your sample size is to small (60 observation for 12 parameters). I do not have a source for it but recall that rough rule of thumb of at-least ~5 observation per parameter. Have you tried a fit without the interaction?