Solved – Logistic regression to adjust for confounders in treatment effect estimation: when is the model satisfactory

logisticmachine learningregressionstatistical significancetreatment-effect

I am trying to measure the effect of a treatment on a binary outcome using observational data. However, the group that was treated and the group that was not treated are not equivalent: assignment to the treatment or control group is not randomized and probably depends on many different variables.

I want to perform regression adjustment to estimate the effect of treatment while taking into account confounding variables, in order to have an idea of the "true" effect of my treatment. Essentially, I want to fit a logistic regression model where the dichotomous outcome is explained by treatment and other confounding variables. Then I want to look at the coefficient of treatment to estimate treatment effect.

My main question is: how can I know whether my model is satisfactory? I don't know what the "true" confounding variables are and I have access to a large volume of data and predictors. However, I don't know when I can safely say that my model has efficiently corrected for the structural bias of the data. I'm thinking about reading the pseudo R squared in the logistic regression results, but I'm not sure what a "good" value would be. I'm also wondering if there are other methods to assess that my model correctly estimates treatment effect.

Best Answer

In general, there may be any number of confounders and they may be known or unknown. Nothing inside your data will tell you, whether some key confounders are not known to you and this instead requires subject matter knowledge. There are a number of methods for adjusting for potential confounders including covariate adjustment (which you seem to consider) and propensity score methods (e.g. matching by observations by propensity or stratification by propensity score - using them as covariates in the model is rather problematic).

Assuming you think you have information on all the most important potential confounders, you may face the problem that you may not have enough data for it to be clear, which ones of these may play a role and in what form (for continuous ones, you may of course have any kind of non-linear effect). p>0.05 or 0.15 or AIC1 > AIC2 or pseudo R squared based on a limited sample size is in no way a criterion that could be appropriate for justifying that we do not need to adjust for a potential confounder. In addition, model building causes severe issues with inference (if you go for covariate adjustment) and you would have to do a lot of complicated adjustments (e.g. bootstrapping the whole model building process) to make sure the inference you obtain for the effect of interest is in anyway meaningful. Apparently, propensity score matching is a lot more robust to overfitting the propensity score model, you could use all kinds of spline methods and may for that reason be of interest. It also has the nice feature that you can check whether the covariates across matched pairs or strata are reasonably balanced.

Observational data analysis is one of the more challenging statistical tasks and there are a number of good books (e.g. the one by Rosenbaum) and key articles (e.g. by Rubin and/or Rosenbaum) that I would recommend.

Related Question