If $Y_1,\ldots,Y_n\sim \text{i.i.d.} \operatorname N(\mu,\sigma^2)$ then the sample mean $(Y_1+\cdots+Y_n)/n$ is both the least-squares estimator of $\mu$ and the maximum-likelihood estimator of $\mu.$
It is also the best linear unbiased estimator of $\mu,$ i.e.
- it is a linear combination of $Y_1,\ldots,Y_n,$ and
- it is unbiased in the sense that its expected value remains equal to $\mu$ if $\mu$ changes, and
- it is best in the sense that it has a smaller variance than does any other estimator satisfying the two conditions above.
- It is also better than all other unbiased estimators of $\mu.$ For example, the sample median is an unbiased estimator of $\mu$ that is not a linear combination of $Y_1,\ldots,Y_n,$ and it has a larger variance than that of the sample mean. The fact that it is better than all other unbiased estimators is at the same depth as the one-to-one nature of the two-sided Laplace transform.
The same thing applies to more elaborate sorts of linear models. For example, suppose we have
$$
\text{independent } Y_i \sim \operatorname N(a+bx_i, \sigma^2) \text{ for } i=1,\ldots,n.
$$
Then the least-squares estimators of $a$ and $b$ are likewise BLUE.
In the situations above, least-squares estimation of $\mu$ or $(a,b)$ coincides with maximum-likelihood estimation.
The proofs of the assertions in the bulleted list above, except for the fourth bullet point, can be done with far less information than that the $Y\text{s}$ have the distributions above. It is enough to assume that
- $Y_1,\ldots,Y_n$ all have expected value $\mu,$ or that they have respective expected values $a+bx_i,$ and
- $Y_1,\ldots,Y_n$ all have the same variance (not necessarily the same distribution), and
- $Y_1,\ldots, Y_n$ are uncorrelated (not necessarily independent).
The Gauss–Markov theorem says that these three assumptions are enough to guarantee that least-squares is BLUE.
But with these weaker Gauss–Markov assumptions, it makes no sense to speak of maximum likelihood, since we don't have a parametrized family of probability distributions.
If $E[Y|X=x]$ is linear, i.e., for example $\beta_0 + \beta_1x$, then $\hat{\beta}_0 + \hat{\beta}_1x$ the it is still unbiased and have minimal variance (as the best linear predictor). However, if $E[Y|X=x] = g(\beta; x)$ is nonlinear, then $g(\hat{\beta}; x)$ is simply biased by the Jensen inequality, thus clearly is not a minimal variance unbiased estimator.
Best Answer
Two things:
You are confusing residuals and errors; they are not the same. The former are the terms $\hat{e}_i = y - X \hat{\beta}$ and, in particular, they are estimated. The errors correspond to the random component in your model to which you have to ascribe some assumptions. Notice that these assumptions do not need to be met by your residuals: for instance, one may assume the errors are independent, but OLS residuals will sum to zero, showing they are not independent.
Setting the mean of the errors to zero is done to ensure the model is identifiable. Notice that if I consider the model $y = \beta_0 + \beta_1 x + \epsilon$ with $E(\epsilon) = \mu$, then this model cannot be distinguished from $y = \beta_0^\prime + \beta_1x + \epsilon^\prime$ where $\beta_0^\prime = \beta_0 + \mu$ and $E(\epsilon^\prime) = 0$. Without this, parameter estimation makes little sense (and hence you would have no Gauss-Markov theorem).