Logistic Regression R – Modeling Binary Outcome and Predictor in Different Categories

categorical datainterpretationlogistic

I could use some advice. I am trying to model the relationship between a binary outcome (impact present/impact absent) and a binary predictor (threat present/threat absent), and see if that relationship varies among threat categories. In other words, does the correlation between threats and impacts vary significantly among threat categories. Since I am modeling a binary outcome, it seems like logistic regression would be a reasonable approach, and I will discuss what I have so far below. If anyone has suggestions for other approaches they would be welcome. My end goal is to be able to say, for each threat category, the likelihood of an impact being present if a threat present is x.

My dataframe looks like this:
species = a categorical variable with 128 levels
threat_category = a categorical variable with 17 levels
impact.pres = a binomial variable with present = 1 and not_present = 0
threat.pres = a binomial variable with present = 1 and not_present = 0

Example data (smaller than actual dataset):

dat <- cbind(Species = rep(letters[1:10], each = 5),
             threat_cat = rep(c("recreation", "climate", "pollution", "fire", "invasive_spp"), 10),
             impact.pres = sample(0:1, size = 50, replace = T),
             threat.pres = sample(0:1, size = 50, replace = T))

I am running a no-intercept model because I am interested in the true coefficients for each threat, not the difference between each threat and a reference threat.

My model and output looks something like this:

mod<- glm(impact.pres ~ 0 + threat.pres*threat_cat, data = dat, family = "binomial")
summary(mod)

Call:
glm(formula = impact.pres ~ 0 + threat.pres * threat_cat, family = "binomial", 
    data = dat)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.66511  -0.90052  -0.00022   0.90052   1.89302  

Coefficients:
                                     Estimate Std. Error z value Pr(>|z|)
threat.pres                        -6.931e-01  1.323e+00  -0.524    0.600
threat_catclimate                   6.931e-01  8.660e-01   0.800    0.423
threat_catfire                      6.513e-16  1.000e+00   0.000    1.000
threat_catinvasive_spp              1.099e+00  1.155e+00   0.951    0.341
threat_catpollution                -9.163e-01  8.367e-01  -1.095    0.273
threat_catrecreation                6.931e-01  8.660e-01   0.800    0.423
threat.pres:threat_catfire         -9.163e-01  1.987e+00  -0.461    0.645
threat.pres:threat_catinvasive_spp -1.099e+00  1.958e+00  -0.561    0.575
threat.pres:threat_catpollution    -1.596e+01  2.284e+03  -0.007    0.994
threat.pres:threat_catrecreation    1.099e+00  1.958e+00   0.561    0.575

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 69.315  on 50  degrees of freedom
Residual deviance: 56.785  on 40  degrees of freedom
AIC: 76.785

Number of Fisher Scoring iterations: 16

In my real data, threat.pres and most of the individual impacts are significant, but none of the interaction terms are. I am wondering if I have specified the model correctly to be able to answer my question, and if so, how best to interpret these coefficients, and the significance levels associated with them.

Thank you for your time.

Best Answer

I think you actually need different syntax for your model. Here is the same model with a slightly different parameterization:

set.seed(100)
dat <- data.frame(Species = rep(letters[1:10], each = 5),
                  threat_cat = rep(c("recreation", "climate", "pollution", "fire", "invasive_spp"), 10),
                  impact.pres = sample(0:1, size = 50, replace = T),
                  threat.pres = sample(0:1, size = 50, replace = T))

mod <- glm(impact.pres ~ 0 + threat_cat/threat.pres, 
           data = dat, family = "binomial")
summary(mod)
#> 
#> Call:
#> glm(formula = impact.pres ~ 0 + threat_cat/threat.pres, family = "binomial", 
#>     data = dat)
#> 
#> Coefficients:
#>                                      Estimate Std. Error z value Pr(>|z|)
#> threat_catclimate                   5.108e-01  7.303e-01   0.699    0.484
#> threat_catfire                      1.609e+00  1.095e+00   1.469    0.142
#> threat_catinvasive_spp             -1.386e+00  1.118e+00  -1.240    0.215
#> threat_catpollution                 1.386e+00  1.118e+00   1.240    0.215
#> threat_catrecreation               -1.386e+00  1.118e+00  -1.240    0.215
#> threat_catclimate:threat.pres      -5.108e-01  1.592e+00  -0.321    0.748
#> threat_catfire:threat.pres         -2.018e+01  3.261e+03  -0.006    0.995
#> threat_catinvasive_spp:threat.pres  1.792e+00  1.443e+00   1.241    0.214
#> threat_catpollution:threat.pres     8.770e-16  1.581e+00   0.000    1.000
#> threat_catrecreation:threat.pres    1.995e+01  2.917e+03   0.007    0.995
#> 
#>     Null deviance: 69.315  on 50  degrees of freedom
#> Residual deviance: 45.511  on 40  degrees of freedom
#> AIC: 65.511
#>

^{Created on 2022-03-02 by the reprex package (v2.0.1)}

(Note that because you did not set a seed the results will differ slightly.)

With this parameterization, you can interpret the "main effect" of each category of the categorical variable as the log odds of impact.pres for that category when threat.pres is 0. For example, the coefficient on threat_catclimate (.5108) means that the log odds of impact.pres for the climate group of threat_cat is .5108 when threat.pres is 0, or, the odds of impact.pres for the climate group of threat_cat is $\exp(.5108)=1.67$ when threat.pres is 0.

You can interpret the interaction term between threat.pres and each level of threat_cat as the difference in the log odds of impact.pres when threat.pres is 1 vs. when threat.pres is 0 for that group of threat_cat. For example, the coefficient on threat_catclimate:threat.pres (-.5108) means the log odds of impact.pres for the climate group of threat_cat is -.5108 lower when threat.pres is 1 than when threat.pres is 0.

From this model (but not the standard parameterization or your attempt), it is easy to see the "effect" of the binary variable at each level of the categorical variable without requiring a reference category.

Related Solutions

Solved – Correlation among categories between categorical nominal variables

The "focal" association between category $i$ of one nominal variable and category $j$ of the other one is expressed by the frequency residual in the cell $ij$, as we know. If the residual is 0 then it means the frequency is what is expected when the two nominal variables are not associated. The larger the residual the greater is the association due to the overrepresented combination $ij$ in the sample. The large negative residual equivalently says of the underrepresented combination. So, frequency residual is what you want.

Raw residuals are not suitable though, because they depend on the marginal totals and the overall total and the table size: the value is not standardized in any way. But SPSS can display you standardized residuals also called Pearson residuals. St. residual is the residual divided by an estimate of its standard deviation (equal to the sq. root of the expected value). St. residuals of a table have mean 0 and st. dev. 1; therefore, st. residual serves a z-value, like z-value in a distribution of a quantitative variable (actually, it is z in Poisson distribution). St. residuals are comparable between different tables of same size and the same total $N$. Chi-square statistic of a contingency table is the sum of the squared st. residuals in it. Comparing st. residuals in a table and across same-volumed tables helps identify the particular cells that contribute most to chi-square statistic.

SPSS also displays adjusted residuals (= adjusted standardized residuals). Adj. residual is the residual divided by an estimate of its standard error. Interesting that adj. residual is just equal to $\sqrt{N}r_{ij}$, where $N$ is the grand total and $r_{ij}$ is the Pearson correlation (alias Phi correlation) between dummy variables corresponding to the categories $i$ and $j$ of the two nominal variables. This $r$ is exactly what you say you want to compute. Adj. residual is directly related to it.

Unlike st. residual, adj. residual is also standardized wrt to the shape of the marginal distributions in the table (it takes into consideration the expected frequency not only in that cell but also in the cells outside its row and its column) and so you can directly see the strength of the tie between categories $i$ and $j$ - without worrying about whether their marginal totals are big or small relative the other categories'. Adj. residual is also like a z-score, but now it is like z of normal (not Poisson) distribution. If adj. residual is above 2 or below -2 you may conclude it is significant at p<0.05 level$^1$. Adj. residuals are still effected by $N$; $r$'s are not, but you can obtain all the $r$s from adj. residuals, following the above formula, without spending time to produce dummy variables.$^2$

In regard to your second question, about 3-way category ties - this is possible as part of the general loglinear analysis which also displays residuals. However, practical use of 3-way cell residuals is modest: 3(+)-way association measures are not easily standardized and are not easily interpretable.

$^1$ In st. normal curve $1.96 \approx 2$ is the cut-point of 2.5% tail, so 5% if you consider both tails as with 2-sided alternative hypothesis.

$^2$ It follows that the significance of the adjusted residual in cell $ij$ equals the significance of $r_{ij}$. Besides, if there is only 2 columns in the table and you are performing z-test of proportions between $\text {Pr}(i,1)$ and $\text {Pr}(i,2)$, column proportions for row $i$, the p-value of that test equals the significance of both (any) adj. residuals in row $i$ of the 2-column table.

R – How to Ask if Correlation Between Two Binary Variables Varies Between Groups?

I don't think glm function work. For example, in this code:

mod2 <- glm(impact.count ~ 0 + threat.count + Species, data = df, 
            family = binomial(link = "logit"))

The coefficients actually estimate the effect of Species on impact when we controls for threat. It's not the correlation between count and impact for some specific species.

You can use scale function to implement variable standardization for each specie:

library(tidyverse)

df_z_score <- df %>%
  group_by(Species) %>%
  mutate(threat_z = scale(threat.count),
         impact_z = scale(impact.count))

Then

lm(threat_z ~ impact_z + factor(Species)*impact_z, data = 
              df_z_score)

Because the regression coefficient of the standardized variable is equal to the correlation , the coefficient impact_z = -6.124e-01 is actually the correlation between impact and threat in the reference group.
The coefficient of interaction term is the change of correlation coefficient relative to the reference group. P-value indicates whether the change is significant.

Best Answer

Related Solutions

Solved – Correlation among categories between categorical nominal variables

R – How to Ask if Correlation Between Two Binary Variables Varies Between Groups?

Related Question