R – How to Ask if Correlation Between Two Binary Variables Varies Between Groups?

categorical datacorrelationinterpretationlogisticr

This seems like a simple coding/statistics problem, but I've been working on this and reading about it for days, and I just can not seem to wrap my head around it … I am a biologist, not a statistician, and any help would be appreciated.

I am trying to find a way to ask if the degree of correlation or relationship between two binary variables (impacts present/absent ~ threats present/absent) varies between species, and separately, between categories. Suggestions on better approaches/packages/ways to code what I'm after would be appreciated, as well as general input on what I've already done. My dataframe looks like this:

set.seed(123)
df <- data.frame(Species = rep(c("plant1", "plant2", "plant3", 
       "plant4", "plant5"), each=5), Category= rep(c("A", "B", 
       "C", "D","E"), 5), threat.count = sample(0:1, replace = T, 
       size = 25), impact.count = sample(0:1, replace = T, 
       size = 25))

I can ask what the general correlation between threats and impacts is with a non-parametric Spearman test for correlation between paired samples:

cor.test(df$threat.count, df$impact.count, method = "spearman", 
    exact = FALSE, conf.int=TRUE)  
# rho = -0.1666667; in my real data the correlation is much 
# higher, around 0.26.

I would interpret this as: Overall, there is a 26% correlation between threats and impacts.

However, I would like to dig in and ask if the degree of correlation varies between Species (and later between Categories), and if so, how (e.g. is the correlation between threats and impacts stronger for some species than for others?).

I have tried creating both generalized linear models, and generalized linear mixed models to get at this, and am not sure if either answers my questions and if I am interpreting them correctly.

To ask if degree of correlation varies between species overall, I could do something like this:

mod0 <- glm(impact.count ~ threat.count, data = df, family = 
            binomial(link = "logit"))
mod1 <- glm(impact.count ~ threat.count + Species, data = df, 
            family = binomial(link = "logit"))
anova(mod0, mod1, test = 'LRT') # here, no, but in my real data, 
                                # yes

#all I can say from that would be 'Yes/no, the degree of correlation between threats and impacts varies between species'… but I would like to know how much?

So, we can look at the summary from mod1:

summary(mod1)

As I understand it, in this output, the coefficient estimate for threat.count is the log-odds of an impact being present (impact.count = 1) with a 1-unit change in threat.count (aka if threat count = 1). The coefficient estimates for Species2-Species5 are the difference in log-odds between their respective coefficients and the coefficient for Species1 (the Intercept term). I could get the "true" coefficients for all categories by running a no-intercept model like this:

mod2 <- glm(impact.count ~ 0 + threat.count + Species, data = df, 
            family = binomial(link = "logit"))
summary(mod2)

The coefficient estimates for Species1:Species5 here are the log-odds that an impact will be present (impact.count = 1) for each Species if threat count is … constant(?) 1(?) mean= 0.5(?).

I could get the odds ratios and probabilities for either version with:

exp(coef(modx))                     #get odds ratios 
exp(coef(modx))/(1+exp(coef(modx))) #get probabilities

My issue here is, how do I/can I interpret any of these in terms of whether/how much the correlation between threats and impacts varies between species?

I have also tried making a generalized linear mixed model:

library(lme4)
mod3 <- glmer(impact.count ~ threat.count + (1|Species), data = 
              df, family = binomial(link = "logit"))
summary(mod3)

This gives me an estimate for threat.count, but I am running into the same interpretation problem as before.

I also tried using lmList to look at the relationship between impact count and threat count separately for each species, but am worried about whether this is a statistically sound approach… any multiple comparisons issues? Also, how would I get it to spit out whether the sub-models are significant?

corr.spp.list <- lme4::lmList(impact.count ~ threat.count 
                 |Species, data = df, 
                 family = binomial(link = "logit"), 
                 warn = TRUE) #fitting each model separately by 
                              # species
corr.spp.list

Best Answer

I don't think glm function work. For example, in this code:

mod2 <- glm(impact.count ~ 0 + threat.count + Species, data = df, 
            family = binomial(link = "logit"))

The coefficients actually estimate the effect of Species on impact when we controls for threat. It's not the correlation between count and impact for some specific species.

You can use scale function to implement variable standardization for each specie:

library(tidyverse)

df_z_score <- df %>%
  group_by(Species) %>%
  mutate(threat_z = scale(threat.count),
         impact_z = scale(impact.count))

Then

lm(threat_z ~ impact_z + factor(Species)*impact_z, data = 
              df_z_score)

Because the regression coefficient of the standardized variable is equal to the correlation , the coefficient impact_z = -6.124e-01 is actually the correlation between impact and threat in the reference group.
The coefficient of interaction term is the change of correlation coefficient relative to the reference group. P-value indicates whether the change is significant.

Related Solutions

GLM – Removing Intercept from GLM for Multiple Factorial Predictors Only Works for First Factor in Model

That trick of getting a parameter for each level of the factor by removing the intercept only works when there is only one factor, as you have seen. You can understand why by counting degrees of freedom: Let factor $a$ have $a$ levels, factor $b$ with $b$ levels. Then factor $a$ have $a-1$degrees of freedom, which means that the indicator matrix with $a$ columns representing with, with a $1$ in each row for the level present at that row, has rank $a-1$. Likewise factor $b$ has $b-1$ degrees of freedom. The intercept has one degree of freedom. So the model formula $ ~ a + b$ (which really is $ ~ a + b + 1$) has $1 + a-1 + b-1 = a+b-1$ degrees of freedom. Removing the intercept (model formula $ ~ a + b - 1$) represents the same model, only the parametrization changed. So it must also have $ a + b - 1 $ degrees of freedom. That $-1$ shows that that there cannot be $a+b$ parameters, so one of the factors still must get one parameter less than number of levels.

That explains what you have seen. But still you can get a coefficient for the missing level of $b$, which should be zero, simply. (depending on the contrasts you are using).

To make this a bit more explicit let us see at an example. I will use R for the matrix algebra. To make design matrices (in R parlance "model matrices") from factors, we need to define contrast functions. I use the default:

> options("contrasts")
$contrasts
        unordered           ordered 
"contr.treatment"      "contr.poly"

First we make two factors for a simple, fully crossed design:

a  <- factor(rep(letters[1:3], 3))
b  <- factor(rep(letters[1:3], each=3))

Then design matrices for each of them:

> X1 <- model.matrix( ~ a-1)
> X2 <- model.matrix( ~b-1)
> X1
  aa ab ac
1  1  0  0
2  0  1  0
3  0  0  1
4  1  0  0
5  0  1  0
6  0  0  1
7  1  0  0
8  0  1  0
9  0  0  1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$a
[1] "contr.treatment"

> X2
  ba bb bc
1  1  0  0
2  1  0  0
3  1  0  0
4  0  1  0
5  0  1  0
6  0  1  0
7  0  0  1
8  0  0  1
9  0  0  1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$b
[1] "contr.treatment"

Each of them, separately, is of full rank:

library(MASS)
library(Matrix)  
> Matrix::rankMatrix(X1)
[1] 3
attr(,"method")
[1] "tolNorm2"
attr(,"useGrad")
[1] FALSE
attr(,"tol")
[1] 1.998401e-15
> Matrix::rankMatrix(X2)
[1] 3
attr(,"method")
[1] "tolNorm2"
attr(,"useGrad")
[1] FALSE
attr(,"tol")
[1] 1.998401e-15

But when combined there is a rank deficit, so they must have one dimension "in common":

rankMatrix(cbind(X1, X2))
[1] 5
attr(,"method")
[1] "tolNorm2"
attr(,"useGrad")
[1] FALSE
attr(,"tol")
[1] 1.998401e-15

To identify the common dimension we use the Null() function from package MASS, calculating the null space:

 Null(t(cbind(X1, X2)))
           [,1]
[1,] -0.4082483
[2,] -0.4082483
[3,] -0.4082483
[4,]  0.4082483
[5,]  0.4082483
[6,]  0.4082483

Yes, the common dimension is the constant vector.

Logistic Regression R – Modeling Binary Outcome and Predictor in Different Categories

I think you actually need different syntax for your model. Here is the same model with a slightly different parameterization:

set.seed(100)
dat <- data.frame(Species = rep(letters[1:10], each = 5),
                  threat_cat = rep(c("recreation", "climate", "pollution", "fire", "invasive_spp"), 10),
                  impact.pres = sample(0:1, size = 50, replace = T),
                  threat.pres = sample(0:1, size = 50, replace = T))

mod <- glm(impact.pres ~ 0 + threat_cat/threat.pres, 
           data = dat, family = "binomial")
summary(mod)
#> 
#> Call:
#> glm(formula = impact.pres ~ 0 + threat_cat/threat.pres, family = "binomial", 
#>     data = dat)
#> 
#> Coefficients:
#>                                      Estimate Std. Error z value Pr(>|z|)
#> threat_catclimate                   5.108e-01  7.303e-01   0.699    0.484
#> threat_catfire                      1.609e+00  1.095e+00   1.469    0.142
#> threat_catinvasive_spp             -1.386e+00  1.118e+00  -1.240    0.215
#> threat_catpollution                 1.386e+00  1.118e+00   1.240    0.215
#> threat_catrecreation               -1.386e+00  1.118e+00  -1.240    0.215
#> threat_catclimate:threat.pres      -5.108e-01  1.592e+00  -0.321    0.748
#> threat_catfire:threat.pres         -2.018e+01  3.261e+03  -0.006    0.995
#> threat_catinvasive_spp:threat.pres  1.792e+00  1.443e+00   1.241    0.214
#> threat_catpollution:threat.pres     8.770e-16  1.581e+00   0.000    1.000
#> threat_catrecreation:threat.pres    1.995e+01  2.917e+03   0.007    0.995
#> 
#>     Null deviance: 69.315  on 50  degrees of freedom
#> Residual deviance: 45.511  on 40  degrees of freedom
#> AIC: 65.511
#>

^{Created on 2022-03-02 by the reprex package (v2.0.1)}

(Note that because you did not set a seed the results will differ slightly.)

With this parameterization, you can interpret the "main effect" of each category of the categorical variable as the log odds of impact.pres for that category when threat.pres is 0. For example, the coefficient on threat_catclimate (.5108) means that the log odds of impact.pres for the climate group of threat_cat is .5108 when threat.pres is 0, or, the odds of impact.pres for the climate group of threat_cat is $\exp(.5108)=1.67$ when threat.pres is 0.

You can interpret the interaction term between threat.pres and each level of threat_cat as the difference in the log odds of impact.pres when threat.pres is 1 vs. when threat.pres is 0 for that group of threat_cat. For example, the coefficient on threat_catclimate:threat.pres (-.5108) means the log odds of impact.pres for the climate group of threat_cat is -.5108 lower when threat.pres is 1 than when threat.pres is 0.

From this model (but not the standard parameterization or your attempt), it is easy to see the "effect" of the binary variable at each level of the categorical variable without requiring a reference category.

Best Answer

Related Solutions

GLM – Removing Intercept from GLM for Multiple Factorial Predictors Only Works for First Factor in Model

Logistic Regression R – Modeling Binary Outcome and Predictor in Different Categories

Related Question