Logistic Regression – Variables to Include After Obtaining a Minimally Sufficient Set from a DAG

causalitydagepidemiologylogistic

I have created a DAG using daggity, and from this DAG, two variables need to be controlled to evaluate the unbiased total effect of the exposure on the outcome. However, I'm confused about whether I should only include the exposure, outcome, and the two variables that appear to be confounders in my logistic regression model? I understand that "the minimally sufficient adjustment set is the list of DAG elements that require adjustment (e.g., using regression, matching, or weighting) in order to accurately estimate the magnitude of the relationship between an exposure and an outcome", in this case, I can "adjust" this two variables (sex and age) by including them in the model in addition with my exposure and outcome?
Thanks!

Best Answer

Yes, if your DAG is correct and age/sex is the minimal adjustment set, you can include those variables in your logistic regression as controls. BUT:

Without knowing your research problem, it's impossible to judge the plausibility of your DAG, but generally observational studies including only demographic 'covariates of convenience' are unlikely to provide credible causal estimates. Does your theory really warrant a DAG that has only age/sex as confounders? Or are there likely to be other that you haven't considered.
With logistic regression, controlling for other factors changes the estimand from a marginal odds ratio to a conditional odds ratio, which is often not desirable - the marginal odds ratio is more commonly what you're after, and conditional odds ratios may be (much) further from 1 than the marginal odds ratio. Usually, I would suggest back transforming to average marginal effects in percentage points, or you could consider using IPTW (instead of regression control) to estimate the marginal odds ratio with logistic regression.

Related Solutions

Solved – Multiple minimally sufficient adjustment sets in a Directed Acyclic Graph (DAG): Which unbiased estimate should be presented

Let's start by defining some terms. Bias is the average distance from the true parameter of effects estimated from an estimator across many repeated samples. A biased estimate is an estimate coming from an estimator that yields nonzero bias. Variance is the typical distance of each estimate from an estimator to the average of those estimates across many repeated samples. Error is the distance between a single estimate and the true parameter. The average error of an estimator (across many repeated samples) is a combination between its bias and its variance.

You want an effect estimate that is unbiased and has low variance (i.e., high precision). This ensures that the estimate in your sample has as little error as possible and that your confidence interval for the effect is as narrow as possible while remaining at the nominal level of coverage. If your DAG is true, finding the minimally sufficient adjustment set can potentially help in arriving at an unbiased estimate of the effect, but it doesn't necessarily help you arrive at an estimate with the lowest error (i.e., distance from the truth). In addition, the DAG tells you which variables to control for, but not how to control for them; you can still have a biased effect estimate if you incorrectly model the relationships among the adjustment variables, treatment, and outcome.

In essence, DAGs tell you about identification, not about estimation performance. There may be estimators that rely on adjustment sets that are not sufficient to remove confounding but have lower error than estimators that rely on sufficient adjustment sets. A few examples come to mind: conditioning on many near-instruments that are actually confounders will increase the variance of the effect estimate, while not conditioning on them, i.e., only conditioning on strong predictors of the outcome, might yield an estimate with lower error because it has lower variance, even if it's a little biased. In the case of butterfly bias, there may be no sufficient adjustment set, and yet one can arrive at a reasonably low-error estimate by conditioning on the collider (even though doing so induces bias). There are so many possibilities that it's impossible to enumerate them, and no DAG can distinguish among them.

There are a few considerations that have been agreed upon:

Conditioning on confounders reduces bias
Conditioning on instruments and near instruments may not affect bias, but will increase variance (and in the case of unobserved confounding, will increase bias)
Conditioning on strong predictors of the outcome, even if they are not confounders, decreases variance without affecting bias.

A minimally sufficient adjustment set might have many near-instruments and will definitely not include predictors of the outcome that are not confounders, making it potentially a sub-optimal set to condition on. On the other hand, conditioning on several strong predictors of the outcome, even if they are not confounders, but missing some true confounders that have weak effects on the outcome, may yield an estimate that has low error due to its low variance, even if some bias is allowed in.

The are variety of variable selection and modeling techniques available to estimate treatment effects, taking into account the relative performance of different types of variables. See my answer here for some examples of contemporary methods.

If you get two estimates resulting from conditioning on different minimally sufficient adjustment sets, there are many possibilities that cannot be distinguished from each other. The first is that both estimates are unbiased, but one or both of them has high variance, so one or both of them have high error. You can check to see which one you'd expect to have lower error based on the considerations mentioned above, and favor that one. It will probably have a narrower confidence interval. The second is that bias is induced by failing to correctly control for the variables in the adjustment set. Here, you may have successfully chosen the correct variables to control for, but you incorrectly modeled the relationships among the adjustment set, the treatment, and the outcome. You can avoid this by using a technique that tries to get around this problem by using a flexible model. The third is that the DAG is wrong in some way, and one or both of the adjustment sets is incorrect. Here, it may still be possible that the incorrect adjustment set yields an estimate with lower error; it's impossible to know. The variety of adjustment sets act as sensitivity analyses for each other, and if they all yield different conclusions, then there is not enough information available to you to make a substantive conclusion (i.e., analogous to not having a large enough sample to distinguish between a true effect and sampling error).

To summarize: Minimally sufficient adjustment sets (and sufficient adjustment sets in general) don't tell you which estimates will have the lowest error; they only tell you which variables are required for nonparametric identification of a causal effect. Estimating an effect with low error requires many different considerations, some of which are apparent from a DAG and others which are not. Knowing a sufficient adjustment set doesn't mean you know how to control for the variables to arrive at an unbiased effect estimate. Even unbiased effect estimates may have high error due to having high variance. There are variety of reasons why estimates might differ from each other, and no good ways to adjudicate among them.

Multiple Regression – Why Perform Matching for Causal Inference Instead of Regressing on Confounders

As I see it, there are two related reasons to consider matching instead of regression. The first is assumptions about functional form, and the second is about proving to your audience that functional form assumptions do not affect the resulting effect estimate. The first is a statistical matter and the second is epistemic. Consider the tale below that attempts to illustrate how the choice between matching and regression could play out.

We'll assume you have measured a sufficient adjustment set to satisfy the backdoor criterion (i.e., all relevant confounders have been measured) with no measurement error or missing data, and that your goal is to estimate the marginal treatment effect of the treatment on an outcome. We'll also assume the standard assumptions of positivity and SUTVA hold. We'll consider a continuous outcome first, but much of the discussion extends to general outcomes.

Part 1: Regression

You decide to run a regression of the outcome on the treatment and confounders as a way to control for confounding by these variables because that is what linear regression is supposed to do. However, the effect estimate is only unbiased under extremely strict circumstances. First, that the treatment effect is constant across levels of the confounders, and second, that the linear model describes the conditional relationship between the outcome and the confounders. For the first, you might include an interaction between the treatment and each confounder, allowing for heterogeneous treatment effects while estimating the marginal effect. This is equivalent to g-computation (1), which involves using the fitted regression model to generate predicted values under treatment and control for all units and using the difference in the means of these predicted values as the effect estimate.

That still assumes a linear model for the outcomes under treatment and control. Okay, we'll use a flexible machine-learning method like random forests instead. Well, now we can't claim our estimator is unbiased, only possibly consistent, and it still requires the specific machine learning model to approach the truth at a certain rate. Okay, we'll use Superlearner (2), a stacking method that takes on the rate of convergence of the fastest of its included models. Well, now we don't have a way to conduct inference, and the model might still be wrong. Okay, we'll use a semiparametric efficient doubly-robust estimator like augmented inverse probability weighting (AIPW) (3) or targeted minimum loss-based estimation (TMLE) (4). Well, that's only consistent if the true models fall in the Donsker class of models. Okay, we'll use cross-fitting with AIPW or TMLE to relax that requirement (5).

Great. You've taken regression to its extreme, relaxing as many assumptions as possible and landing with a multiply-robust estimator (multiply-robust in the sense that if one of many models are correct, the estimator is consistent) with generally good inference properties (but it can be bootstrapped so getting the variance exactly right isn't a big problem). Have we solved causal inference?

You submit the results of your cross-fit TMLE estimate using Superlearner for the propensity score and potential outcome models with a full library including highly adaptive lasso and many other models, which, under weak assumptions, are all that are required for a truly consistent estimator that converges at a parametric rate.

A reviewer reads the paper and says, "I don't believe the results of this model."

"Why not?" you say. "I used the optimal estimator with the best properties; it is consistent and semiparametric efficient with few, if any, assumptions on the functional forms of the models."

"Your estimator is consistent," says the reviewer, "but not unbiased. That means I can only trust its results in general and as N goes to infinity. How do I know you have successfully eliminated bias in the effect estimate in this dataset?"

"..."

Part 2: Matching to the Rescue

You read about a hot new method called "propensity score matching" (6). It was big in 1983, and, even in 2021, you see it in almost every paper published in specialized medical journals. You come across King and Nielsen's influential paper "Why Propensity Scores Should Not Be Used for Matching" (7) and Noah's answer on CV describing the many drawbacks to using propensity score matching. Okay, you'll use genetic matching instead (8), and minimize the energy distance between the samples (9), including a flexibly estimated propensity score as a covariate to match on. You find that balance can be improved by using substantive knowledge to incorporate exact matching and caliper constraints that prioritize balance on covariates known to be important to the outcome. You decide to use full matching to relax the requirement of 1:1 matching to include more units in the analysis (10).

You estimate the treatment effect using a simple linear regression of the outcome on the treatment and the covariates, including the matching weights in the regression and using a cluster-robust standard error to account for pair membership (11). You resubmit the result of your full matching analysis using exact matching and calipers for prognostically important variables and a distance matrix estimating using genetic matching on the covariates and a flexibly estimated propensity score.

The reviewer reads your new manuscript. "Wow, you've learned a lot. But I still don't believe you've removed bias the in the effect estimate."

"Look at the balance tables," you say. "The covariate distributions are almost identical."

"I see low standardized mean differences," says the reviewer, "but imbalances could remain on other features of the covariate distribution."

"Look at the balance tables in the appendix which contain balance statistics for pairwise interactions, polynomials up to the 5th power of each covariate, and Kolmogorov-Smirnov statistics to compare the full covariate distributions. There are no meaningful differences between the samples, and no differences at all on the most highly prognostic covariates because of the exact matching constraints and calipers."

"I see..."

"Also, I used Branson's randomization test (12) with the energy distance as the balance statistic to show that my sample is better balanced not only than a hypothetical randomized trial using the same data, but also a block randomized trial, and even a covariate balance-constrained randomized trial."

"Wow, I guess I don't have much to say..."

"My outcome regression estimator isn't just consistent, it's truly unbiased in this sample. Also, because I incorporated pair membership into the analysis, my standard errors are smaller and more accurate and the resulting estimate is less sensitive to unobserved confounding* (13)."

"I get it!"

Part 3: The criticism

Frank Harrell bursts into the room. "Wait, by discarding so many units in matching, you have thrown away so much useful data and needlessly decimated your precision." Mark van der Laan follows. "Wait, by using substantive 'expertise' you are not letting the analysis method find the true patterns in the data that might have eluded researchers, and your estimator does not converge at a known rate, let alone a parametric one! And there is no guarantee that your inference is valid!" I, your humble narrator, too, join in on the dogpile. "Wait, by using exact matching constraints and calipers, you have shifted your estimand away from the ATE or any a priori describable estimand (14)! Your effect estimate may be unbiased, but unbiased for what?"

You stand there, bewildered, defeated, feeling like you have come nowhere since you asked your simple question on CrossValidated what felt like years ago, no closer to understanding whether you should use matching or regression to estimate causal effects.

The curtains close.

Part 4: Epilogue

In the face of uncertainty and scarcity, we are left with tradeoffs. The choice between a regression-based method and matching to estimate a causal effect depends on how you and your audience choose to manage those tradeoffs and prioritize the advantages and drawbacks of each method.

Standard regression requires strong functional form assumptions, but with advanced methods, those can be relaxed, at the cost of giving up on bias and focusing on consistency and asymptotic inference. Many of these advanced methods work best in large samples, and they still require many choices along the way (e.g., which specific estimator to use, which machine learning methods to include in the Superlearner library, how many folds to use for cross-validation and cross-fitting, etc.). Although the multiply-robust methods may guarantee consistency and fast convergence rates in general data, it is not immediately clear how you can assess how well they eliminated bias in your dataset, potentially leaving one skeptical of their actual performance in your one instance.

Matching methods require few functional form assumptions because no models are required (e.g., when using a distance matrix that doesn't depend solely on the propensity score, like that resulting from genetic matching). You can control confounding by adjusting the specification of the match, focusing more effort on hard-to-balance or prognostically important variables. You can come close to guaranteeing unbiasedness by ensuring you have achieved covariate balance, which can and should be measured extremely broadly with a skeptic in mind. You can use tools for analyzing randomized trials and trials with more powerful and robust designs. This comes at the cost of possibly decimating your precision by discarding huge amounts of data, changing your estimand so that your effect estimate doesn't generalize to a meaningful population and isn't replicable, and relying on ad hoc, "artisanal" methods with no clear path for valid inference.

The advantage matching has over regression, and the reason why I think it is so valuable and why I devoted my graduate training to understanding and improving matching and its use by applied researchers as the author of the R package cobalt, WeightIt, MatchIt, and others, is an epistemic advantage. With matching, you can more effectively convince a reader that what you have done is trustworthy and that you have accounted for all possible objections to the observed result, and can at least point to specific assumptions and explain how their violation might affect results. This all centers on covariate balance, the similarity between covariate distributions across the treatment groups. By reporting balance broadly and submitting the resulting matched data to a battery of tests and balance measures, you can convince yourself and your readers that the resulting effect estimate is unbiased and therefore trustworthy (given the assumptions mentioned at the beginning, though these may be tenuous, and neither matching nor regression can solve that problem).

However, not everyone agrees that this advantage so important, or more important than consistency and valid asymptotic inference. There can never be consensus on this matter, because consensus requires knowing the truth, and science (including statistics research) is about searching for an inherently unknowable truth (i.e., the true parameters that govern or describe our world). That is, if we knew the true causal effect, we could know the best method to estimate it, but we don't, so we can't. We can only do our best using the knowledge we have and try to manage the inherent constraints and tradeoffs as well as we can as we fumble around in the dark using the pinpoint of light the universe has shown us.

*Only when using a special method of inference for matched samples.

Snowden JM, Rose S, Mortimer KM. Implementation of G-Computation on a Simulated Data Set: Demonstration of a Causal Inference Technique. Am J Epidemiol. 2011;173(7):731–738.
van der Laan MJ, Polley EC, Hubbard AE. Super Learner. Statistical Applications in Genetics and Molecular Biology [electronic article]. 2007;6(1). (https://www.degruyter.com/view/j/sagmb.2007.6.issue-1/sagmb.2007.6.1.1309/sagmb.2007.6.1.1309.xml). (Accessed October 8, 2019)
Daniel RM. Double Robustness. In: Wiley StatsRef: Statistics Reference Online. American Cancer Society; 2018 (Accessed November 9, 2018):1–14.(http://onlinelibrary.wiley.com/doi/abs/10.1002/9781118445112.stat08068). (Accessed November 9, 2018)
Gruber S, van der Laan MJ. Targeted Maximum Likelihood Estimation: A Gentle Introduction. 2009;17.
Zivich PN, Breskin A. Machine Learning for Causal Inference: On the Use of Cross-fit Estimators. Epidemiology. 2021;32(3):393–401.
Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55.
King G, Nielsen R. Why Propensity Scores Should Not Be Used for Matching. Polit. Anal. 2019;1–20.
Diamond A, Sekhon JS. Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Review of Economics and Statistics. 2013;95(3):932–945.
Huling JD, Mak S. Energy Balancing of Covariate Distributions. arXiv:2004.13962 [stat] [electronic article]. 2020;(http://arxiv.org/abs/2004.13962). (Accessed December 22, 2020)
Stuart EA, Green KM. Using full matching to estimate causal effects in nonexperimental studies: Examining the relationship between adolescent marijuana use and adult outcomes. Developmental Psychology. 2008;44(2):395–406.
Abadie A, Spiess J. Robust Post-Matching Inference. Journal of the American Statistical Association. 2020;0(ja):1–37.
Branson Z. Randomization Tests to Assess Covariate Balance When Designing and Analyzing Matched Datasets. Observational Studies. 2021;7:44–80.
Zubizarreta JR, Paredes RD, Rosenbaum PR. Matching for balance, pairing for heterogeneity in an observational study of the effectiveness of for-profit and not-for-profit high schools in Chile. The Annals of Applied Statistics. 2014;8(1):204–231.
Greifer N, Stuart EA. Choosing the Estimand When Matching or Weighting in Observational Studies. arXiv:2106.10577 [stat] [electronic article]. 2021;(http://arxiv.org/abs/2106.10577). (Accessed September 17, 2021)