Logistic Regression – Why Doesn’t Logistic Regression Require Heteroscedasticity and Normality of Residuals, Nor a Linear Relationship?

assumptionslinear modellogisticmultiple regression

I was reading this link when I got stuck trying to understand. Not even Wooldridge in Introductory Econometrics, or O'Reilly Data Science from Scratch explored this question. And I was surprised I couldn't find any explanation for this question. So, the problem is related to Logistic Regression assumptions. Why doesn't Logistic Regression require the error and linear relationship assumptions that Linear Regression require?

I will try to explain better, but if the question get messy, the title is the short question and the thing that got inside my head… So, I know that Logistic Regression is about category targets, but the regression actually predicts the probability of an event/category, right? Isn't that something that would require linear relationships?

Regarding the errors, the normality assumption isn't required because the errors will be zero or 1? I thought some assumption would be required, so we don't get any bias (e.g.: we have a logit to predict if someone will pay the debt, but our model gets most of NYC people prediction right, but not from NJ, idk).

Well, I think my question got a little messy because I tried to explain better, but hopefully, people will understand and the assumptions will be more explored than most tutorials we have.

Thanks in advance

Best Answer

Isn't that something that would require linear relationships?

The assumption is that the effect of covariates is linear on the log odds scale. You might see logistic regression written as

$$ \operatorname{logit}(p) = X \beta $$

Here, $\operatorname{logit}(p) = \log\left( \frac{p}{1-p} \right)$. Additionally, remember that linearity does not mean straight lines in GLM.

Regarding the errors, the normality assumption isn't required because the errors will be zero or 1?

Not quite. Logistic regression estimates a probability, the error (meaning observation minus prediction) will be between 0 and 1.

Why doesn't Logistic Regression require the error and linear relationship assumptions that Linear Regression require?

Logistic regression is still a linear model, it is just linear in a different space so as to respect the constraint that $0 \leq p \leq 1$. AS for your titular question regarding the error term and its variance, note that a binomial random variable's variance depends on its mean ($\operatorname{Var}(X) = np(1-p)$). Hence, the variance chances as the mean changes, meaning the variance is (technically) heteroskedastic (i.e. non-constant, or at the very least changes based on what $X$ is because $p$ changes based on $X$).

Related Solutions

Regression Assumptions – Why Test Normality of Overall Residuals Instead of Conditional on $\hat{y}$?

Couldn't we have normal residuals at each predicted value of y, while having overall residuals that were quite non-normal?

No -- at least, not under the standard assumption that the variance of the errors is constant.

You can think of the distribution of overall residuals as a mixture of normal distributions (one for each level of $\hat{y}$). By assumption, all of these normal distributions have the same mean (0) and the same variance. Thus, the distribution of this mixture of normals is itself simply a normal distribution.

So from this we can form a little syllogism based on modus tollens: if P then Q; not Q; therefore not P. In this case we have: If the individual distributions given the values of the predictor X are normal (and their variances are equal), then the distribution of the overall residuals is normal. So if we observe that the distribution of overall residuals is apparently not normal, this implies that the distributions given X are not normal with equal variance. Which is a violation of the standard assumptions.

@BigBendRegion points out something in the comments that I think is worth adding to this answer for emphasis. The line of argument I outlined above works for refuting normality, but it cannot be used to confirm normality. That is, if we check the marginal distribution of residuals and see that it does appear normal, this does NOT entail that the residuals conditional on X are normal (see HERE for counterexamples). In terms of the P and Q statements above, observing that Q is true does not entail that P is true. That would be affirming the consequent.

Linear and Generalized Linear Models – Why Linear Regression Has Assumptions on Residual but Generalized Linear Model Has Assumptions on Response

Simple linear regression having Gaussian errors is a very nice attribute that does not generalize to generalized linear models.

In generalized linear models, the response follows some given distribution given the mean. Linear regression follows this pattern; if we have

$y_i = \beta_0 + \beta_1 x_i + \epsilon_i$

with $\epsilon_i \sim N(0, \sigma)$

then we also have

$y_i \sim N(\beta_0 + \beta_1 x_i, \sigma)$

Okay, so the response follows the given distribution for generalized linear models, but for linear regression we also have that the residuals follow a Gaussian distribution. Why is it emphasized that the residuals are normal when that's not the generalized rule? Well, because it's the much more useful rule. The nice thing about thinking about normality of the residuals is this is much easier to examine. If we subtract out the estimated means, all the residuals should have roughly the same variance and roughly the same mean (0) and will be roughly normally distributed (note: I say "roughly" because if we don't have perfect estimates of the regression parameters, which of course we do not, the variance of the estimates of $\epsilon_i$ will have different variances based on the ranges of $x$. But hopefully there's enough precision in the estimates that this is ignorable!).

On the other hand, looking at the unadjusted $y_i$'s, we can't really tell if they are normal if they all have different means. For example, consider the following model:

$y_i = 0 + 2 \times x_i + \epsilon_i$

with $\epsilon_i \sim N(0, 0.2)$ and $x_i \sim \text{Bernoulli}(p = 0.5)$

Then the $y_i$ will be highly bimodal, but does not violate the assumptions of linear regression! On the other hand, the residuals will follow a roughly normal distribution.

Here's some R code to illustrate.

x <- rbinom(1000, size = 1, prob = 0.5)
y <- 2 * x + rnorm(1000, sd = 0.2)
fit <- lm(y ~ x)
resids <- residuals(fit)
par(mfrow = c(1,2))
hist(y, main = 'Distribution of Responses')
hist(resids, main = 'Distribution of Residuals')

Best Answer

Related Solutions

Regression Assumptions – Why Test Normality of Overall Residuals Instead of Conditional on $\hat{y}$?

Linear and Generalized Linear Models – Why Linear Regression Has Assumptions on Residual but Generalized Linear Model Has Assumptions on Response

Related Question