R – How to Ask if Correlation Between Two Binary Variables Varies Between Groups?

categorical datacorrelationinterpretationlogisticr

This seems like a simple coding/statistics problem, but I've been working on this and reading about it for days, and I just can not seem to wrap my head around it … I am a biologist, not a statistician, and any help would be appreciated.

I am trying to find a way to ask if the degree of correlation or relationship between two binary variables (impacts present/absent ~ threats present/absent) varies between species, and separately, between categories. Suggestions on better approaches/packages/ways to code what I'm after would be appreciated, as well as general input on what I've already done. My dataframe looks like this:

set.seed(123)
df <- data.frame(Species = rep(c("plant1", "plant2", "plant3", 
       "plant4", "plant5"), each=5), Category= rep(c("A", "B", 
       "C", "D","E"), 5), threat.count = sample(0:1, replace = T, 
       size = 25), impact.count = sample(0:1, replace = T, 
       size = 25)) 

I can ask what the general correlation between threats and impacts is with a non-parametric Spearman test for correlation between paired samples:

cor.test(df$threat.count, df$impact.count, method = "spearman", 
    exact = FALSE, conf.int=TRUE)  
# rho = -0.1666667; in my real data the correlation is much 
# higher, around 0.26. 

I would interpret this as: Overall, there is a 26% correlation between threats and impacts.

However, I would like to dig in and ask if the degree of correlation varies between Species (and later between Categories), and if so, how (e.g. is the correlation between threats and impacts stronger for some species than for others?).

I have tried creating both generalized linear models, and generalized linear mixed models to get at this, and am not sure if either answers my questions and if I am interpreting them correctly.

To ask if degree of correlation varies between species overall, I could do something like this:

mod0 <- glm(impact.count ~ threat.count, data = df, family = 
            binomial(link = "logit"))
mod1 <- glm(impact.count ~ threat.count + Species, data = df, 
            family = binomial(link = "logit"))
anova(mod0, mod1, test = 'LRT') # here, no, but in my real data, 
                                # yes

#all I can say from that would be 'Yes/no, the degree of correlation between threats and impacts varies between species'… but I would like to know how much?

So, we can look at the summary from mod1:

summary(mod1)

As I understand it, in this output, the coefficient estimate for threat.count is the log-odds of an impact being present (impact.count = 1) with a 1-unit change in threat.count (aka if threat count = 1). The coefficient estimates for Species2-Species5 are the difference in log-odds between their respective coefficients and the coefficient for Species1 (the Intercept term). I could get the "true" coefficients for all categories by running a no-intercept model like this:

mod2 <- glm(impact.count ~ 0 + threat.count + Species, data = df, 
            family = binomial(link = "logit"))
summary(mod2)

The coefficient estimates for Species1:Species5 here are the log-odds that an impact will be present (impact.count = 1) for each Species if threat count is … constant(?) 1(?) mean= 0.5(?).

I could get the odds ratios and probabilities for either version with:

exp(coef(modx))                     #get odds ratios 
exp(coef(modx))/(1+exp(coef(modx))) #get probabilities

My issue here is, how do I/can I interpret any of these in terms of whether/how much the correlation between threats and impacts varies between species?

I have also tried making a generalized linear mixed model:

library(lme4)
mod3 <- glmer(impact.count ~ threat.count + (1|Species), data = 
              df, family = binomial(link = "logit"))
summary(mod3)

This gives me an estimate for threat.count, but I am running into the same interpretation problem as before.

I also tried using lmList to look at the relationship between impact count and threat count separately for each species, but am worried about whether this is a statistically sound approach… any multiple comparisons issues? Also, how would I get it to spit out whether the sub-models are significant?

corr.spp.list <- lme4::lmList(impact.count ~ threat.count 
                 |Species, data = df, 
                 family = binomial(link = "logit"), 
                 warn = TRUE) #fitting each model separately by 
                              # species
corr.spp.list

Best Answer

I don't think glm function work. For example, in this code:

mod2 <- glm(impact.count ~ 0 + threat.count + Species, data = df, 
            family = binomial(link = "logit"))

The coefficients actually estimate the effect of Species on impact when we controls for threat. It's not the correlation between count and impact for some specific species.

You can use scale function to implement variable standardization for each specie:

library(tidyverse)

df_z_score <- df %>%
  group_by(Species) %>%
  mutate(threat_z = scale(threat.count),
         impact_z = scale(impact.count))

Then

lm(threat_z ~ impact_z + factor(Species)*impact_z, data = 
              df_z_score)

Because the regression coefficient of the standardized variable is equal to the correlation , the coefficient impact_z = -6.124e-01 is actually the correlation between impact and threat in the reference group.
The coefficient of interaction term is the change of correlation coefficient relative to the reference group. P-value indicates whether the change is significant.

Related Question