While your question is similar to a number of other questions on site, aspects of this question (such as your emphasis on consistency) make me think they're not sufficiently close to being duplicates.
Why not choose some other objective function to minimize?
Why not, indeed? If you objective is different from least squares, you should address your objective instead!
Nevertheless, least squares has a number of nice properties (not least, an intimate connection to estimating means, which many people want, and a simplicity which makes it an obvious first choice when teaching or trying to implement new ideas).
Further, in many cases people don't have a clear objective function, so there's an advantage to choosing what's readily available and widely understood.
That said, least squares also has some less-nice properties (sensitivity to outliers, for example) -- so sometimes people prefer a more robust criterion.
minimize the sum of square error will give you CONSISTENT estimator of your model parameters
Least squares is not a requirement for consistency. Consistency isn't a very high hurdle -- plenty of estimators will be consistent. Almost all estimators people use in practice are consistent.
and by Gauss-Markov theorem, this estimator is BLUE.
But in situations where all linear estimators are bad (as would be the case under extreme heavy-tails, say), there's not much advantage in the best one.
if you choose to minimize some other objective function that is not the SSE, then there is no guarantee that you will get consistent estimator of your model parameter. Is my understanding correct?
it's not hard to find consistent estimators, so no that's not an especially good justification of least squares
why when we try to compare different models using cross validation, we again, use the SSE as the judgment criterion? [...] Why not other criterion?
If your objective is better reflected by something else, why not indeed?
There is no lack of people using other objective functions than least squares. It comes up in M-estimation, in least-trimmed estimators, in quantile regression, and when people use LINEX loss functions, just to name a few.
was thinking that when you have a dataset, you first set up your model, i.e. make a set of functional or distributional assumptions. In your model, there are some parameters (assume it is a parametric model),
Presumably the parameters of the functional assumptions are what you're trying to estimate - in which case, the functional assumptions are what you do least squares (or whatever else) around; they don't determine the criterion, they're what the criterion is estimating.
On the other hand, if you have a distributional assumption, then you have a lot of information about a more suitable objective function -- presumably, for example, you'll want to get efficient estimates of your parameters -- which in large samples will tend to lead you toward MLE, (though possibly in some cases embedded in a robustified framework).
then you need to find a way to consistently estimate these parameters. Whether you minimize the SSE or LAD or some other objective function,
LAD is a quantile estimator. It's a consistent estimator of the parameter it should estimate in the conditions in which it should be expected to be, in the same way that least squares is. (If you look at what you show consistency for with least squares, there's corresponding results for many other common estimators. People rarely use inconsistent estimators, so if you see an estimator being widely discussed, unless they're talking about its inconsistency, it's almost certainly consistent.*)
* That said, consistency isn't necessarily an essential property. After all, for my sample, I have some particular sample size, not a sequence of sample sizes tending to infinity. What matters are the properties at the $n$ I have, not some infinitely larger $n$ that I don't have and will never see. But much more care is required when we have inconsistency - we may have a good estimator at $n$=20, but it may be terrible at $n$=2000; there's more effort required, in some sense, if we want to use inconsistent estimators as a matter of course.
If you use LAD to estimate the mean of an exponential, it won't be consistent for that (though a trivial scaling of its estimate would be) -- but by the same token if you use least squares to estimate the median of an exponential, it won't be consistent for that (and again, a trivial rescaling fixes that).
Best Answer
The general term for what you are asking about is model selection. You have a set of possible models, in this case something like $$ \begin{aligned} y&=\beta_1x + \beta_0\\ y&=\beta_2x^2 + \beta_1x + \beta_0 \\ y&=\beta_3x^3 + \beta_2x^2 + \beta_1x + \beta_0 \\ \end{aligned}$$ and you want to determine which of these models is most parsimonious with your data. We generally worry about parsimony rather than best-fitting (i.e, highest $R^2$) since a complex model could "over-fit" the data. For example imagine your timing data is generated by a quadratic algorithm, but there's a little bit of noise in the timing (random paging by the OS, clock inaccuracy, cosmic rays, whatever). The quadratic model might still fit reasonably well, but it won't be perfect. However, we can find a (very high order) polynomial that goes through each and every data point. This model fits perfectly but will be terrible at making future predictions and, obviously, doesn't match the underlying phenomenon either. We want to balance model complexity with the model's explanatory power. How does one do this?
There are many options. I recently stumbled upon this review by Zucchini, which might be a good overview. One approach is to calculate something like the AIC (Akaike Information Criterion), which adjusts each model's likelihood to take the number of parameters into account. These are often relatively easy to compute. For example, AIC is: $$ AIC = 2k -2ln(L) $$ where L is the likelihood of the data given the model and k is the number of parameters (e.g., 2 for linear, 3 for quadratic, etc). You compute this criterion for each model, then choose the model with the smallest AIC.
Another approach is to use cross-validation (or something like that) to show that none of your models are over-fit. You could then select the best-fitting model.
That's sort of the general case. However, as @Michelle noted above, you probably don't want to be doing model selection at all if you know something about the underlying phenomemon. In this case, if you have the code or know the underlying algorithm, you should just trace through it to determine the algorithm's order.
Also, keep in mind that the Big-O order of the algorithm isn't technically defined in terms of the best-fit to the observed run time; it's more of a limiting property. You could feasibly have an algorithm with a massive linear component and a small quadratic component to its runtime, something like $$t(N) = 0.0000001n^2 + 999999999n$$ I would bet that a runtime-vs-input size plot for that would be pretty linear-looking over the ranges you're likely to test, but I believe the algorithm would technically be considered $O(n^2)$