Solved – Distribution of Y influenced by predictor X in simple linear regression

linear modelregression

One of the assumptions in simple linear regression is that the error term is supposed to be normally distributed. Now, I found on the internet the following quote:

"You’ll notice there is nothing similar about Y. ε’s distribution is influenced by Y’s, which is why Y has to be continuous, unbounded, and measured on an interval or ratio scale.

But Y’s distribution is also influenced by the X’s. ε’s isn’t. That’s why you can get a normal distribution for ε, but lopsided, chunky, or just plain weird-looking Y."

My question is: is this true? I actually thought it was the other way around, that the distribution of Y is NOT influenced by the predictors.

Best Answer

The quote the OP links to starts with a mistake by referring to the "residuals" while all these assumptions refer to the errors (the residuals are the estimated errors).

Apart from that when we specify a regression equation, we state that as a variable, $Y$ is a function of $X$'s and the error term. It is then natural to say that the distribution of $Y$ will be influenced by the distribution of $X$'s and of the error term, since they determine $Y$ itself.

As a simple example, assume that $Y = a + bX + u$, where $u$ follows a Normal but $X$ follows say, a Gamma distribution, then the distribution of $Y$ cannot be normal, and what it will be will depend on the distribution of $X$ also, and how it "mingles" with the distribution of $u$. Etc.

Even if the regressors are "deterministic", meaning that they cannot be said to follow a statistical distribution, they still affect the parameters of the distribution of $Y$: in the previous example with deterministic regressors, the distribution of $Y$ will be normal with modified mean (but same variance).

In the "conditional expectation function" approach, in principle we consider the joint distribution of $\{Y,X\}$ and the resulting conditional one, and the distribution of the conditional expectation function error springs from these (i.e. here the error is not treated as a separate variable but is defined as $u\equiv Y- E(Y\mid X)$ )

So in all cases, the distribution of $Y$ is influenced by $X$, in one way or the other.

Related Question