The question is very simple: why, when we try to fit a model to our data, linear or non-linear, do we usually try to minimize the sum of the squares of errors to obtain our estimator for the model parameter? Why not choose some other objective function to minimize? I understand that, for technical reasons, the quadratic function is nicer than some other functions, e.g., sum of absolute deviation. But this is still not a very convincing answer. Other than this technical reason, why in particular are people in favor of this 'Euclidean type' of distance function? Is there a specific meaning or interpretation for that?
The logic behind my thinking is the following:
When you have a dataset, you first set up your model by making a set of functional or distributional assumptions (say, some moment condition but not the entire distribution). In your model, there are some parameters (assume it is a parametric model), then you need to find a way to consistently estimate these parameters and hopefully, your estimator will have low variance and some other nice properties. Whether you minimize the SSE or LAD or some other objective function, I think they are just different methods to get a consistent estimator. Following this logic, I thought people use least square must be 1) it produces consistent estimator of the model 2) something else that I don't know.
In econometrics, we know that in linear regression model, if you assume the error terms have 0 mean conditioning on the predictors and homoscedasticity and errors are uncorrelated with each other, then minimizing the sum of square error will give you a CONSISTENT estimator of your model parameters and by the Gauss-Markov theorem, this estimator is BLUE. So this would suggest that if you choose to minimize some other objective function that is not the SSE, then there is no guarantee that you will get a consistent estimator of your model parameter. Is my understanding correct? If it is correct, then minimizing SSE rather than some other objective function can be justified by consistency, which is acceptable, in fact, better than saying the quadratic function is nicer.
In pratice, I actually saw many cases where people directly minimize the sum of square errors without first clearly specifying the complete model, e.g., the distributional assumptions (moment assumptions) on the error term. Then this seems to me that the user of this method just wants to see how close the data fit the 'model' (I use quotation mark since the model assumptions are probably incomplete) in terms of the square distance function.
A related question (also related to this website) is: why, when we try to compare different models using cross-validation, do we again use the SSE as the judgment criterion? i.e., choose the model that has the least SSE? Why not another criterion?
Best Answer
While your question is similar to a number of other questions on site, aspects of this question (such as your emphasis on consistency) make me think they're not sufficiently close to being duplicates.
Why not, indeed? If you objective is different from least squares, you should address your objective instead!
Nevertheless, least squares has a number of nice properties (not least, an intimate connection to estimating means, which many people want, and a simplicity which makes it an obvious first choice when teaching or trying to implement new ideas).
Further, in many cases people don't have a clear objective function, so there's an advantage to choosing what's readily available and widely understood.
That said, least squares also has some less-nice properties (sensitivity to outliers, for example) -- so sometimes people prefer a more robust criterion.
Least squares is not a requirement for consistency. Consistency isn't a very high hurdle -- plenty of estimators will be consistent. Almost all estimators people use in practice are consistent.
But in situations where all linear estimators are bad (as would be the case under extreme heavy-tails, say), there's not much advantage in the best one.
it's not hard to find consistent estimators, so no that's not an especially good justification of least squares
If your objective is better reflected by something else, why not indeed?
There is no lack of people using other objective functions than least squares. It comes up in M-estimation, in least-trimmed estimators, in quantile regression, and when people use LINEX loss functions, just to name a few.
Presumably the parameters of the functional assumptions are what you're trying to estimate - in which case, the functional assumptions are what you do least squares (or whatever else) around; they don't determine the criterion, they're what the criterion is estimating.
On the other hand, if you have a distributional assumption, then you have a lot of information about a more suitable objective function -- presumably, for example, you'll want to get efficient estimates of your parameters -- which in large samples will tend to lead you toward MLE, (though possibly in some cases embedded in a robustified framework).
LAD is a quantile estimator. It's a consistent estimator of the parameter it should estimate in the conditions in which it should be expected to be, in the same way that least squares is. (If you look at what you show consistency for with least squares, there's corresponding results for many other common estimators. People rarely use inconsistent estimators, so if you see an estimator being widely discussed, unless they're talking about its inconsistency, it's almost certainly consistent.*)
* That said, consistency isn't necessarily an essential property. After all, for my sample, I have some particular sample size, not a sequence of sample sizes tending to infinity. What matters are the properties at the $n$ I have, not some infinitely larger $n$ that I don't have and will never see. But much more care is required when we have inconsistency - we may have a good estimator at $n$=20, but it may be terrible at $n$=2000; there's more effort required, in some sense, if we want to use inconsistent estimators as a matter of course.
If you use LAD to estimate the mean of an exponential, it won't be consistent for that (though a trivial scaling of its estimate would be) -- but by the same token if you use least squares to estimate the median of an exponential, it won't be consistent for that (and again, a trivial rescaling fixes that).