Regression – Where Does the Misconception that Y Must Be Normally Distributed Come From?

dependent variableleast squareslinear modelregression

Seemingly reputable sources claim that the dependent variable must be normally distributed:

Model assumptions: $Y$ is normally distributed, errors are normally
distributed, $e_i \sim N(0,\sigma^2)$, and independent, and $X$ is fixed, and
constant variance $\sigma^2$.

Penn State, STAT 504 Analysis of Discrete Data

Secondly, the linear regression analysis requires all variables to be
multivariate normal.

StatisticsSolutions, Assumptions of Linear Regression

This is appropriate when the response variable has a normal
distribution

Wikipedia, Generalized linear model

Is there a good explanation for how or why this misconception has spread? Is its origin known?

Related

Best Answer

'Y must be normally distributed'

must?


In the cases that you mention it is sloppy language (abbreviating 'the error in Y must be normally distributed'), but they don't really (strongly) say that the response must be normally distributed, or at least it does not seem to me that their words were intended like that.

The Penn State course material

speaks about "a continuous variable $Y$", but also about "$Y_i$" as in $$E(Y_i) = \beta_0 + \beta_1 x_i$$ where we could regard $Y_i$, which is as amoeba called in the comments 'conditional', normally distributed,

$$Y_i \sim N(\beta_0 + \beta_1x_i,\sigma^2)$$

The article uses $Y$ and $Y_i$ interchangeably. Throughout the entire article one speaks about the 'distribution of Y', for instance:

  • when explaining some variant of GLM (binary logistic regression),

    Random component: The distribution of $Y$ is assumed to be $Binomial(n,\pi)$,...

  • in some definition

    Random Component – refers to the probability distribution of the response variable ($Y$); e.g. normal distribution for $Y$ in the linear regression, or binomial distribution for $Y$ in the binary logistic regression.

however at some other point they also refer to $Y_i$ instead of $Y$:

  • The dependent variable $Y_i$ does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal,...)

The statisticssolutions webpage

is an extremely brief, simplified, stylized description. I am not sure you should take this serious. For instance, it speaks about

..requires all variables to be multivariate normal...

so that is not just the response variable,

and also the the 'multivariate' descriptor is vague. I am not sure how to get that interpreted.

The wikipedia article

has an additional context explained in brackets:

Ordinary linear regression predicts the expected value of a given unknown quantity (the response variable, a random variable) as a linear combination of a set of observed values (predictors). This implies that a constant change in a predictor leads to a constant change in the response variable (i.e. a linear-response model). This is appropriate when the response variable has a normal distribution (intuitively, when a response variable can vary essentially indefinitely in either direction with no fixed "zero value", or more generally for any quantity that only varies by a relatively small amount, e.g. human heights).

This 'no fixed zero value' seems to point to the case that a linear combination $y+\epsilon$ when $\epsilon \sim N(0,\sigma)$ has an infinite domain (from minus infinity to plus infinity) whereas often many variables have some finite cut-off value (such as counts not allowing negative values).

The particular line has been added on March 8 2012, but note that the first line of the Wikipedia article still reads "a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution" and is not so much (not everywhere) wrong.


Conclusion

So, based on these three examples (which indeed could generate misconceptions, or at least could be misunderstood) I would not say that "this misconception has spread". Or at least it does not seem to me that the intention of those three examples is to argue that Y must be normally distributed (although I do remember this issue has arised before here on stackexchange, the swap between normally distributed errors and normally distributed response variable is easy to make).

So, the assumption that 'Y must be normally distributed' seems to me not like a widespread believe/misconception (as in something that spreads like a red herring), but more like a common error (which is not spread but made independently each time).


Additional comment

An example of the mistake on this website is in the following question

What if residuals are normally distributed, but y is not?

I would consider this as a beginners question. It is not present in the materials like the Penn State course material, the Wikipedia website, and recently noted in the comments the book 'Extending the Linear Regression with R'.

The writers of those works do correctly understand the material. Indeed, they use phrases such as 'Y must be normally distributed', but based on the context and the used formulas you can see that they all mean 'Y, conditional on X, must be normally distributed' and not 'the marginal Y must be normally distributed'. They are not misconceiving the idea themselves, and at least the idea is not widespread among statisticians and people that write books and other course materials. But misreading their ambiguous words may indeed cause the misconception.