Regression linear model without error term

least squareslinear regression

I have this linear model from a regression:

$Y_i$ = $\beta_1X_{i1}$ + … + $\beta_mX_{im}$ + $\epsilon_i$

The matrix representation is:

$Y$ = $X\beta$ + $\epsilon$

In a lot of places like Wikipedia they say that $Y$ = $X\beta$ is a overdetermined system (in fact it is) and then they apply least squares.

My question is: why are them trying to solve $Y$ = $X\beta$? The original system was $Y$ = $X\beta$ + $\epsilon$. Why are them ignoring the error term $\epsilon$?

My guess is that $Y$ is the real value and they are not trying to solve $Y = X\beta$ but $\hat{Y} = X\beta$ where $\hat{Y} = Y – \epsilon$ is the observed value, but I couldn't found this in any book or trustable source so maybe I'm wrong.

Thanks.

Best Answer

Let's suppose we run some experiment with $m$ experimental conditions $n$ times. $Y_i$ is the outcome of the $i$th experiment and $X_{i1},\dots,X_{im}$ is the list of experimental conditions of the $i$th experiment. Let's write $X_i = (X_{i1}, \dots, X_{im})$. Then the data we observe is $(Y_i, X_i),\,i=1\dots,n$. Note that we do observe the true experimental outcome and the true experimental conditions.

Given our data we can ask: How well can our experimental outcome be described as a linear function of the experimental conditions? We can phrase this question as: How close can we get to solving the following system of $n$ equations? $$Y_i = X_i\tilde \beta, \quad i=1\dots,n.$$ In matrix notation, the system is $$Y = X \tilde\beta, \tag{1}$$ where $Y=(Y_1,\dots,Y_n)^T$ and $X$ is the matrix whose $i$th row is $X_i$. Note that $(1)$ is exactly the system of equations you are wondering about.

If we can find a solution $\beta$ to $(1)$ then all is good. Usually this will not be the case, however. Instead, we can try to find an approximate solution: a parameter vector that does not exactly solve $(1)$ but gets "close" to solving it. One way of measuring how close some parameter vector $\beta$ comes to solving $(1)$ is to define residuals $$\varepsilon_i = Y_i - X_i\beta.\tag{2}$$ Then by construction $Y_i = X_i\beta + \varepsilon_i$ holds for all $i$. Note that $\beta$ is a solution to $(1)$ if, and only if $\varepsilon_i =0$ holds for all $i$. Intuitively, $\beta$ is close to solving $(1)$ if the $\varepsilon_i$ are "close to zero". One way of measuring this closeness is by the sum of squared residuals $$\varepsilon_1^2 + \dots + \varepsilon_n^2,$$ where $\varepsilon_i$ is defined by $(2)$. The smaller the sum of squared residuals, the closer $\beta$ gets to being a solution to $(1)$. The parameter vector achieving the smallest sum of squared errors is precisely the ordinary least squares estimator $$\hat \beta = (X^TX)^{-1}X^TY.$$

Given $\hat \beta$ as an approximate solution to $(1)$ we can define $\hat Y_i = X_i\hat \beta$ and $$\hat \varepsilon_i = Y_i - X_i\hat \beta = Y_i - \hat Y_i.$$ Here, $\hat \varepsilon_i$ measures how close $\hat \beta$ gets to solving the $i$th equation in $(1)$.

There are other ways of motivating ordinary least squares but if you are wondering what role the system $Y = X \tilde\beta$ plays then in my opinion the "approximate solution to a system of equations" approach is the one to think about. One nice aspect of this approach is that it shows that linear regression can be motivated without any reference to randomness.

See here for another explanation of this approach. For a slightly different motivation of linear regression see e.g. pages 44-45 here.

Related Solutions

$E[(\hat \beta_1 – \beta_1)X_1]$ where $\beta_1$ is a linear regression coefficient and $\hat \beta_1$ is its least squares estimate

$$ \mathsf{E}\!\left[(\hat{\beta}_1-\beta_1)X_j\right]=\mathsf{E}\!\left[\frac{\sum_{i=1}^n(X_i-\bar{X})\varepsilon_iX_j}{\sum_{i=1}^n (X_i-\bar{X})^2}\right]. $$ If $\mathsf{E}[\varepsilon_i\mid X_1,\ldots,X_n]=0$, $$ \mathsf{E}\!\left[(\hat{\beta}_1-\beta_1)X_j\right]=0. $$

Proof that the plots of the fitted values vs. residuals yields parabola when needed quadratic term omitted

What you needs to use to get between (1) and (2) is

$\gamma_0 = \dfrac{\hat\beta_{0}^2\beta_{11}-\hat\beta_{0}\hat\beta_{1}\beta_{1}+\hat\beta_{1}^2\beta_{0}}{\hat\beta_{1}^2}$
$\gamma_1 = \dfrac{-2\hat\beta_{0}\beta_{11}+\hat\beta_{1}\beta_{1}-\hat\beta_{1}^2}{\hat\beta_{1}^2}$
$\gamma_2 = \dfrac{\beta_{11}}{\hat\beta_{1}^2}$

which works so long as $\hat\beta_{1} \not =0$

But this is not quite the real issue. Here is an illustration of trying to fit a straight line to something which is actually a parabola plus some noise, using R:

set.seed(2020)
truebeta0 <- 4
truebeta1 <- 7
truebeta2 <- 2
sdnoise   <- 3
X <- 1:10
Xsq <- X^2
Y <- truebeta0 + truebeta1 * X + truebeta2 * Xsq + rnorm(10,0,sdnoise)
fit1 <- lm(Y ~ X + Xsq)
fit2 <- lm(Y ~ X)
plot(Y ~ X)
points(fit1$fitted.values ~ X, type="l", col = "blue")
points(fit2$fitted.values ~ X, type="l", col = "red")

to give

and you can see the blue fitted parabola is a good fit with small apparently random residuals, while the red fitted straight line is not so good, with the residuals larger and those in the middle the opposite sign to those at the extremes. If you look at the residuals for the red line, they are indeed visually a parabola plus some noise:

plot(fit2$residuals ~ fit2$fitted.values, type="p", col = "red")
abline(h=0)

Best Answer

Related Solutions

$E[(\hat \beta_1 – \beta_1)X_1]$ where $\beta_1$ is a linear regression coefficient and $\hat \beta_1$ is its least squares estimate

Proof that the plots of the fitted values vs. residuals yields parabola when needed quadratic term omitted

Related Question