Conditional independence of the response variable, regression analysis

generalized linear modellinear modelregression

The book says: for the standard linear regression model, we assume:

\begin{equation}
y_i= \beta_0 + \beta_1 x_i + \epsilon_i
\end{equation}

where $E[\epsilon_i]=0$ and $Var[\epsilon_i]=\sigma^2$.
Homoskedasticity implies that the errors are independent of the covariates.
For constructing confidence interval and hypothesis testing, we assume $\epsilon_i$~$N(0,\sigma^2)$.
In this case, the observations of the response variable follow a (conditional) normal
distribution with
$E[y_i]= \beta_0 +\beta_1 x_i$ ; $Var[y_i]=\sigma^2$
and the $y_i$ are (conditionally) independent given covariate values $x_i$.

Why $y_i$ are conditionally independent from the covariates $x_i$?
Linear regression should be about estimating the expected value of $y_i$ conditioned to a sequence of covariates. How $y_i$ is conditionally independent from $x_i$?

Best Answer

While this is not contained in the part you quote, my guess would be that your book operates under the setup where the regressors are considered to be fixed.

In the setup of this question, that does not change too much. Essentially, you would replace all expectations in your question with expectations conditional on $x_i$.

That said, the idea that the values of the regressors are fixed in repeated samples is generally considered to be somewhat restrictive, as, typically, the regressors are random to the investigator just as the dependent variable is. Exceptions may include experimental setups where the investigator can precisely decide upon, e.g., a dose.

More importantly, however, the fixed regressors idea becomes more restrictive when starting to ask causal questions, think omitted variable bias, instrumental variable approaches etc. These are motivated by correlation of the regressor with the error term, which is not a very natural consideration when the regressors are taken to be fixed constants.

Related Question