Solved – Multivariable logistic regression and Anova

anovalogisticmultiple regressionr

Tl;dr: I highlighted my primary questions at the bottom of the post.

I'm having difficulty with the workflow for a multiple logistic regression for my (genomic) data and I'm starting to think this comes from a fundamental misunderstanding of GLMs.

To elaborate, I have a series of proportions for a large number of gene features which can (essentially) be divided into respective "successes" (Cnt1) and "failures" (Cnt2) (see mock dataframe).

Pair  Cnt1  Cnt2 Gene Tissue
A1:B1   23   201   A   T1
A1:B1   43   565   B   T1
A1:B1   65   123   A   T2
A1:B1   76   678   B   T2
A1:B1   63   675   A   T3
A1:B1   77   618   B   T3
A2:B2   34   143   A   T1
...

Foremost, I'd like to see how pair status (Gene A vs Gene B) influences the proportion of Cnt (e.g. pr(Cnt1)).

For this, I reasoned this model would suffice:

pr(Cnt1) ~ Gene

or in R syntax,

glm(cbind(Cnt2,Cnt1) ~ Gene, family= binomial(link="logit"), data = x)

However, I need to identify pairs that exhibit strong differences in pr(Cnt1) between Gene A and Gene B but also differ between T1, T2, or T3.

I've tried to look at the interaction between Gene:Tissue (see code below) but this where I start to get lost conceptually.

glm(cbind(Cnt2,Cnt1) ~ Gene*Tissue, family= binomial(link="logit"), data = x)

An example summary of the coefficients from one pair:

                             Estimate Std. Error    z value      Pr(>|z|)
(Intercept)                3.42965051  0.1122062 30.5656166 3.506965e-205
GeneB                      0.14472442  0.1356404  1.0669714  2.859848e-01
tissueT2                  -0.18315952  0.1963746 -0.9327047  3.509724e-01
tissueT3                   0.06658153  0.1481779  0.4493351  6.531900e-01
GeneB:tissueT2            -0.27579716  0.2330609 -1.1833697  2.366627e-01
GeneB:tissueT3            -0.87463736  0.3661340 -2.3888451  1.690143e-02

It was recommended that I run an Anova (i.e. via "car") on the fit for each pair. This is where whatever tenuous understanding I have falls apart. Why do I need to run the anova on these models? I'd guess it's to test the fit of the model but is it truly essential in this scenario?

The anova output reduces all the predictor variable interactions to simply the Gene:Tissue interaction.

Given the fact that I'm (presumably) interested in pairs for which the variation in pr(Cnt1) is explained by an interaction between Gene and at least one of the tissues, could I just use the glm output?

Additionally, why is the intercept so low?

Overall questions

(1) Why should one run an anova after fitting a logistic regression and what is it's purpose in this context?

(2) Can I use the output of the model summary, and additionally, what does the extremely low intercept in this example mean?

I understand that is an incredibly convoluted question but some help and/or (if possible) resources that might answers would be greatly appreciated!

Best Answer

There are a couple of approaches to your situation. You are trying to regress a binary response on a two way interaction. If you have no continuous variables you would like to throw in, then one option is to perform a linear regression. If you understand the linear regression model, then this solution will be easier for you. It will also give the same predictions as the logistic regression model as long as the regressor remains the two-way interaction. You would reshape your data thus:

Pair  Cnt1 Gene Tissue Weights
A1:B1    1    A     T1      23
A1:B1    0    A     T1     201
A1:B1    1    B     T1      43
A1:B1    0    B     T1     565

Hence each current row becomes two rows and a weights variable represents the number of individuals satisfying that condition, so frequency weights. Then you can perform your linear regression model (this alternative shape for the data also works for logistic regression). Your coefficients will be on the probability scale and signify differences in probabilities. Your syntax could be:

lm(Cnt1 ~ Gene * Tissue, data = x, weights = Weights)

Personally, I prefer:

lm(Cnt1 ~ 0 + Gene + Gene : Tissue, data = x, weights = Weights)
# this tells me the proportion for each gene at T1, the baseline then
# how T2 and T3 differ from this baseline for each gene

If your model will expand to include continuous predictors, or you may drop the interaction between both factors, then a logistic model remains a good bet. With continuous predictors or no interaction between both factors, a linear model may predict garbage probabilities (outside of 0 and 1) and then the model may be both biased (incorrect coefficients) and inconsistent (situation not helped by large sample size).

If you stick with the logistic model. Since you are intent on modeling pr(Cnt1), then your syntax should be:

# cbind(success, failure)
glm(cbind(Cnt1, Cnt2) ~ Gene * Tissue, family = binomial, data = x) # default link for binomial is logit

The linear model component here, made up of your coefficients, is not on the probability scale but on the logit or log odds scale. Your intercept should be the negative of what you currently have. And that will be a huge number since it's on the log odds scale. (Remember you are modeling 1/0 data not 23/635 data). I think this answer does a good job of explaining how to work with logistic regression coefficients.

Finally, about the anova function, see ?anova.glm for details. This is actually analysis of deviance (deviance being bad) which is how we typically compare generalized linear models. What it does is it looks as the reduction in deviance as each term in your formula gets added to the model, so like a test for each new term during forward entry. The reduction for the interaction, being the last term added to the model, works like an omnibus test for all the terms representing the interaction. If the reduction is not statistically significant, some investigators may choose to simplify the model by dropping the interaction.

Related Solutions

R Programming – How to Use ANOVA and Linear Regression Together

this looks like that actually the intercepts are compared and not the slopes?

Your confusion there relates to the fact that you must be very careful to be clear about which intercepts and slopes you mean (intercept of what? slope of what?).

The role of a coefficient of a 0-1 dummy in a regression can be thought of both as a slope and as a difference of intercepts, simply by changing how you think about the model.

Let's simplify things as far as possible, by considering a two-sample case.

We can still do one-way ANOVA with two samples but it turns out to essentially be the same as a two-tailed two sample t-test (the equal variance case).

Here's a diagram of the population situation:

two group means as regression, population situation

If $\delta = \mu_2-\mu_1$, then the population linear model is

$y = \mu_1 + \delta x + e$

so that when $x=0$ (which is the case when we're in group 1), the mean of $y$ is $\mu_1 + \delta \times 0 = \mu_1$ and when $x=1$ (when we're in group 2), the mean of $y$ is $\mu_1 + \delta \times 1 = \mu_1 + \mu_2 - \mu_1 = \mu_2$.

That is the coefficient of the slope ($\delta$ in this case) and the difference in means (and you might think of those means as intercepts) is the same quantity.

$ $

To help with concreteness, here are two samples:

Group1:  9.5  9.8 11.8
Group2: 11.0 13.4 12.5 13.9

How do they look?

sample plot

What does the test of difference in means look like?

As a t-test:

    Two Sample t-test

data:  values by group
t = -5.0375, df = 5, p-value = 0.003976
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.530882 -1.469118
sample estimates:
mean in group g1 mean in group g2 
             9.9             12.9

As a regression:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.9000     0.4502  21.991 3.61e-06 ***
groupg2       3.0000     0.5955   5.037  0.00398 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7797 on 5 degrees of freedom
Multiple R-squared:  0.8354,    Adjusted R-squared:  0.8025 
F-statistic: 25.38 on 1 and 5 DF,  p-value: 0.003976

We can see in the regression that the intercept term is the mean of group 1, and the groupg2 coefficient ('slope' coefficient) is the difference in group means. Meanwhile the p-value for the regression is the same as the p-value for the t-test (0.003976)

Solved – Different p values for anova and summary in lme

This can happen when you build your model sequentially as you did. in this case, you should report the results of the AIC/LRT analyses (anova(baseline, GroupM, ReligionM, Group_Rel)) and then build a simple model using only 'significant' variables from that initial analysis from which to make inference about parameter values.

ReligionM2 <-lme(Ratings ~ Religion, random = ~1|PID/Religion/Stimuli, data = newDframe, method = "ML)

summary(ReligionM2)

That should reconcile the P-values from the overall model comparison approach and the P-values for the individual parameters.

It's interesting that you have religion as both a fixed effect and a random effect.

Best Answer

Related Solutions

R Programming – How to Use ANOVA and Linear Regression Together

Solved – Different p values for anova and summary in lme

Related Question