[Math] Question about the objective function of Linear regression

linear regressionstatistical-inferencestatistics

Suppose we have some data and I am fitting those data with a simple linear regression model. As the graph show below, the black line represents the true model that generated the data. Denoted it as:
$$y = \beta_0 + \beta_1 x_1 + \epsilon$$
where $\epsilon$ is normal distributed with mean 0 and variance $\sigma^2$. The red line represents the estimated model and denoted it as:
$$\hat y = \hat\beta_0 + \hat\beta_1 x_1$$
Let the residuals denoted by $\hat\epsilon$. The objective of linear regression is to minimize the sum of the square of residuals $\sum_{i=1}^n{\hat\epsilon^2}$ so that we can find a estimated line that is close to the true model. However, intuitively, in order to find a estimated line that is as close as possible to the true line, we just need to minimize the distance between the true line and the estimated line. That is, $|\hat\epsilon – \epsilon|$(as the graph shown below). This leads to the new objective function $Min \sum(\hat\epsilon – \epsilon)^2$.

What confuse me the most is that the least square method is trying to fit a estimated model that is as close as possible to all the observations but not the real model. However, a estimated model that is close to the observations doesn't guarntee to be also close to the real model since an observation with a large error term will draw the estimated line away from the true line. For this reason, the objective function $Min \sum(\hat\epsilon – \epsilon)^2$ makes more sense to me even thought in practice, we can not take $Min \sum(\hat\epsilon – \epsilon)^2$ as our objective function since $\epsilon$ is unknown.

My question is then, why do we use Min$\sum\hat \epsilon^2$ as our objective function if it is not guranteed to generate a model that is close to the true model? Are $Min \sum\hat\epsilon^2$ and $Min \sum(\hat\epsilon – \epsilon)^2$ equivalent to each other(Or one could lead to another)?

Any help would be appreciated. Thanks in advance.

enter image description here

Best Answer

Comments:

(a) In some applications one minimizes $D = \sum_i |\hat e_i|$ instead of $Q = \sum_i \hat {e_i}^2.$ An advantage of $D$ (my notation) is that it puts less emphasis on points far from the line produced by data. (But one usually pays due attention to points far from the usual line made using $Q;$ this is part of 'regresssion diagnostics'.) Advantages of using $Q$ are computational simplicity and existence of standard distributions to use in testing and making CIs.

(b) As mentioned by @littleO, expressions involving $\epsilon_i$ are off the table because the $\epsilon_i$ are not known.

(c) As for general 'objectives' of regression, I immediately think of two: First, prediction of $y$-values from new $x$-values (not used in making the line). Second, understanding relationships among variables: either to verify known theoretical relationships as holding true in practice or to discover new relationships.

Note: Recent hand surgery has reduced me to hunt-and-peck typing for a few days, and probably to making even more typos than usual. If you have rep to fix them, please feel free.

Related Question