Solved – Regression residual distribution assumptions

assumptionsnormal distributionnotationregressionresiduals

Why is it necessary to place the distributional assumption on the errors, i.e.

$y_i = X\beta + \epsilon_{i}$, with $\epsilon_{i} \sim \mathcal{N}(0,\sigma^{2})$.

Why not write

$y_i = X\beta + \epsilon_{i}$, with $y_i \sim \mathcal{N}(X\hat{\beta},\sigma^{2})$,

where in either case $\epsilon_i = y_i – \hat{y}$.
I've seen it stressed that the distributional assumptions are placed on the errors, not the data, but without explanation.

I'm not really understanding the difference between these two formulations. Some places I see distributional assumptions being placed on the data (Bayesian lit. it seems mostly), but most times the assumptions are placed on the errors.

When modelling, why would/should one choose to begin with assumptions on one or the other?

Best Answer

In a linear regression setting it is common to do analysis and derive results conditional on $X$, i.e. conditional on "the data". Thus, what you need is that $y\mid X $ is normal, that is, you need $\epsilon$ to be normal. As Peter Flom's example illustrates, one can have normality of $\epsilon$ without having normality of $y$, and, thus, since what you need is normality of $\epsilon$, that's the sensible assumption.

Related Question