Regression – What Does the i.i.d. Assumption of Errors in Linear Regression Imply for the Response Variable Y?

iidmachine learningregression

In the linear regression model we assume that the errors $ε_i$ are independent and identically distributed (i.i.d.) random variables. I am trying to understand what this assumption implies regarding the response variables $y_i$.

As far it concerns the identitally distributed assumption, since $y_i = X_ib+ε_i$ where $ε_i\sim \mathcal{N}(0,\,\sigma^{2})\,$ then $y_i\sim \mathcal{N}(Xb,\,\sigma^{2})\,$. So, the response variables are not identically distributed random variables because they do not have the same mean.

My questions are the following:

  1. Can we assume that $y|X=x$, for example $y|X=5$, are identically distributed since they follow the same distribution with the same mean and variance?
  2. Can we assume that $y$ are independent random variables or $y|X$ are independent random variables or neither of the two?
  3. In the machine learning context we assume that the data $(x_i,y_i)$ are i.i.d. What does this assumption implies about the random variable $y_i$. How is this related with the i.i.d. assumption of the errors in linear regression?
  4. Finally, regarding my opening statements that the errors are i.i.d. and $ε_i\sim \mathcal{N}(0,\,\sigma^{2})\,$ are they correct or we assume that the errors conditional on $X$ are i.i.d. and $ε_i|X_i\sim \mathcal{N}(0,\,\sigma^{2})\,$ or is the same thing?

Best Answer

Remember that the error terms in the regression measure the deviations of the response variable from its conditional mean (given knowledge of the explanatory variables). Indeed, under the stipulated model form for regression, this is essentially the definition of what the error terms are. So you have:

$$\varepsilon_i \equiv Y_i - \mathbb{E}(Y_i|X_i).$$

Observe that each error term is a function of both the response variable and explanatory variable for that data point. Now, if these values are IID then this means that the deviations-from-the-conditional-mean are independent and identically distributed. This does not lead to an IID response variable, except in the trivial case where the explanatory variable has zero variance (i.e., it has a point-mass distribution).

With regard to your specific questions, the answers are as follows:

  1. No. As you correctly point out, under the regression assumptions there is no common distribution for the response variable (and no common mean either). Other than in the trivial case where the explanatory variable has zero variance (i.e., it has a point-mass distribution) the response variable has a conditional mean that depends on the explanatory variable.

  2. The latter. The response variables are conditionally independent conditional on the explanatory variable. As a shorthand we may say that the values $Y_i|X_i$ are independent (though not identically distributed).

  3. This is a much stronger assumption than in regression analysis. It is equivalent to making the standard regression assumptions, but also assuming that the underlying explanatory variables are IID. Once you assume that $X_1,...,X_n \sim \text{IID}$, the regression assumptions imply that the response variable is also marginally IID, which gives the joint IID result.

  4. In regression analysis, all your distributional assumptions are conditional on the explanatory variables, so the actual assumption is that:

$$\varepsilon_1,...,\varepsilon_n | \mathbf{X} \sim \text{IID N}(0,\sigma^2).$$