Solved – from where the error in target variable comes in linear regression

machine learningmathematical-statisticspredictionprobabilityregression

For linear regression, one assumption is that the target variable Y has an underlying linear relationship with features (X1, X2, . . . , Xd), modified by some error term ε that follows a zero-mean Gaussian distribution. I do not understand from where the error term comes. If Y IS the target/true label, then how there can be error in it? Is it introduced because of noise in observation?

Or it means the relationship between Y and the features is not exactly linear, hence linearity assumption introduces some errors?

Best Answer

The classic linear regression model is:

$$ y_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_k x_{i,k} + \epsilon_i$$

The error term captures everything else that's going on besides a linear relation ship with $x_1$ through $x_k$! An entirely equivalent way to write the linear model that may be instructive is:

$$ \epsilon_i = y_i - \left(\beta_0 + \beta_1 x_{i,1} + \ldots + \beta_k x_{i,k}\right) $$

From this, you can get a sense of where linear regression can go wrong. If $\epsilon_i$ has stuff going on such that If $\mathrm{E}\left[\epsilon_i \mid X \right] \neq 0$, then strict exogeneiety is violated and the regressors and the error term are no longer orthogonal. (Orthogonality of the regressors and the error term is what gives rise to the normal equations, to the OLS estimator $\hat{\mathbf{b}} = \left(X'X\right)^{-1} X'\mathbf{y}$.)

Think of the error term as a garbage collection term, a term that collects EVERYTHING ELSE that's going on besides a linear relationship between $y_i$ and your observed regressors $x_1, \ldots, x_k$. What could end up in the error term is limitless. Of course, what's allowed into the error term for OLS to be a consistent estimator isn't limitless :P.

Related Question