Logistic Regression – Why Doesn’t Logistic Regression Require Heteroscedasticity and Normality of Residuals, Nor a Linear Relationship?

assumptionslinear modellogisticmultiple regression

I was reading this link when I got stuck trying to understand. Not even Wooldridge in Introductory Econometrics, or O'Reilly Data Science from Scratch explored this question. And I was surprised I couldn't find any explanation for this question. So, the problem is related to Logistic Regression assumptions. Why doesn't Logistic Regression require the error and linear relationship assumptions that Linear Regression require?

I will try to explain better, but if the question get messy, the title is the short question and the thing that got inside my head… So, I know that Logistic Regression is about category targets, but the regression actually predicts the probability of an event/category, right? Isn't that something that would require linear relationships?

Regarding the errors, the normality assumption isn't required because the errors will be zero or 1? I thought some assumption would be required, so we don't get any bias (e.g.: we have a logit to predict if someone will pay the debt, but our model gets most of NYC people prediction right, but not from NJ, idk).

Well, I think my question got a little messy because I tried to explain better, but hopefully, people will understand and the assumptions will be more explored than most tutorials we have.

Thanks in advance

Best Answer

Isn't that something that would require linear relationships?

The assumption is that the effect of covariates is linear on the log odds scale. You might see logistic regression written as

$$ \operatorname{logit}(p) = X \beta $$

Here, $\operatorname{logit}(p) = \log\left( \frac{p}{1-p} \right)$. Additionally, remember that linearity does not mean straight lines in GLM.

Regarding the errors, the normality assumption isn't required because the errors will be zero or 1?

Not quite. Logistic regression estimates a probability, the error (meaning observation minus prediction) will be between 0 and 1.

Why doesn't Logistic Regression require the error and linear relationship assumptions that Linear Regression require?

Logistic regression is still a linear model, it is just linear in a different space so as to respect the constraint that $0 \leq p \leq 1$. AS for your titular question regarding the error term and its variance, note that a binomial random variable's variance depends on its mean ($\operatorname{Var}(X) = np(1-p)$). Hence, the variance chances as the mean changes, meaning the variance is (technically) heteroskedastic (i.e. non-constant, or at the very least changes based on what $X$ is because $p$ changes based on $X$).

Related Question