Regression – Assumptions of Logistic Regression for Causal Inference Explained

causalityeconometricsendogeneitylogisticregression

I'm trying to understand what are the assumptions for logistic regression when you intend to interpret the parameter as causal? The assumptions for causal OLS regressions is well-known but I can't find a good source for similar assumptions for logistic regressions.

From what I can find on the internet, I think the following assumptions need to hold:

  1. Errors are distributed according to a logistic distribution and are independent of each other
  2. No multicolinearity

My intuition tells me that the independent variables should not be correlated with the error term (no endogeneity) as is in the case of OLS regressions, but I can't find support of this anywhere. Does anyone have a mathematical argument for this? As in where would estimation go wrong?

  • On the same point, when you're interested in the parameter in front of X1 as the causal parameter and X1 is not correlated with the error term, but X2 is correlated with the error term, although you're not interested in the parameter in front of X2 in a causal sense, can you still run this logistic regression and interpret the coefficient in front of X1 as causal? i.e., would the endogeneity of X2 mess up the parameter estimate in front of X1?

Also I read that the errors are not identically distributed but I'm not sure why. Can anyone explain why this is true?

Are there any other assumptions for logistic regressions when you want to use it for causal inference?

Best Answer

The capacity to interpret regression relationships as causal generally depends on experimental protocols rather than the assumed structure of the statistical model. Regression models allow us to relate the explanatory variables statistically to the response variable, where this relationship is made conditional on all the explanatory variables in the model. As a default position, that is still just a predictive relationship, and should not be interpreted causally. That is the case in standard linear regression using OLS estimation, and it is also true in logistic regression.

Suppose we want to interpret a regression relationship causally ---e.g., we have an explanatory variable $x_k$ and we want to interpret its regression relationship with the response variable $Y$ as a causal relationship (the former causing the latter). The thing we are scared of here is the possibility that the predictive relationship might actually be due to a relationship with some confounding factor, which is an additional variable outside the regression that is statistically related to $x_k$ and is the real cause of $Y$. If such a confounding factor exists, it will induce a statistical relationship between these variables that we will see in our regression. (The other mistake you can make is to condition on a mediator variable, which also leads to an incorrect causal inference.)

So, in order to interpret regression relationships causally, we want to be confident that what we are seeing is not the result of confounding factors outside our analysis. The best way to ensure this is to use controlled experimentation to set $x_k$ via randomisation/blinding, thereby severing any statistical link between this explanatory variable and any would-be confounding factor. In the absence of this, the next best thing is to use uncontrolled analysis, but try to bring in as many possible confounding factors as we can, to filter them out in the regression. (No guarantees that we have found them all!) There are also other methods, such as using instrumental variables, but these generally hinge on strong assumptions about the nature of those variables.

None of the assumptions you mention are necessary or sufficient to infer causality. Those are just model assumptions for the logistic regression, and if they do not hold you can vary your model accordingly. The main assumption you need for causal inference is to assume that confounding factors are absent. That can be done by using a randomisation/blinding protocol in your experiment, or it can be left as a (hope-and-pray) assumption.

Related Question