The Gauss-Markov theorem gives the conditions where the OLS estimator is the BLUE, and those conditions do not include normality of the residuals. When we also include that normality assumption, then we can remove the "L" and wind up with the "Best Unbiased Estimator", not just the best linear unbiased estimator (section 2.1, example 1 of the Ohio State econometrics notes).
However, if we do not make the normality assumption, then we can wind up with nonlinear estimators of the coefficients that have lower variance than the OLS estimate but are unbiased. For example, consider heavy-tailed errors and the solution given by minimizing absolute loss (quantile regression at the median), as I do here.
As you move sufficiently far away from normality, all linear estimators may be arbitrarily bad.
Knowing that you can get the best of a bad lot (i.e. the best linear unbiased estimate) isn't much consolation.
If you can specify a suitable distributional model (ay, there's the rub),
maximizing the likelihood has both a direct intuitive appeal - in that it "maximizes the chance" of seeing the sample you did actually see (with a suitable refinement of what we mean by that for the continuous case) and a number of very neat properties that are both theoretically and practically useful (e.g. relationship to the Cramer-Rao lower bound, equivariance under transformation, relationship to likelihood ratio tests and so forth). This motivates M-estimation for example.
Even when you can't specify a model, it is possible to construct a model for which ML is robust to contamination by gross errors in the conditional distribution of the response -- where it retains pretty good efficiency at the Gaussian but avoids the potentially disastrous impact of arbitrarily large outliers.
[That's not the only consideration with regression, since there's also a need for robustness to the effect of influential outliers for example, but it's a good initial step]
As a demonstration of the problem with even the best linear estimator, consider this comparison of slope estimators for regression. In this case there are 100 observations in each sample, x is 0/1, the true slope is $\frac12$ and errors are standard Cauchy. The simulation takes 1000 sets of simulated data and computes the least squares estimate of slope ("LS") as well as a couple of nonlinear estimators that could be used in this situation (neither is fully efficient at the Cauchy but they're both reasonable) - one is an L1 estimator of the line ("L1") and the second computes a simple L-estimate of location at the two values of x and fits a line joining them ("LE").
The top part of the diagram is a boxplot of those thousand slope estimates for each simulation. The lower part is the central one percent (roughly, it is marked with a faint orange-grey box in the top plot) of that image "blown up" so we can see more detail. As we see the least squares slopes range from -771 to 1224 and the lower and upper quartiles are -1.24 and 2.46. The error in the LS slope was over 10 more than 10% of the time. The two nonlinear estimators do much better -- they perform fairly similarly to each other, none of the 1000 slope estimates in either case are more than 0.84 from the true slope and the median absolute error in the slope is in the ballpark of 0.14 for each (vs 1.86 for the least squares estimator). The LS slope has a RMSE of 223 and 232 times that of the L1 and LE estimators in this case (that's not an especially meaningful quantity, however as the LS estimator doesn't have a finite variance when you have Cauchy errors).
There are dozens of other reasonable estimators that might have been used here; this was simply a quick calculation to illustrate that even the best/most efficient linear estimators may not be useful. An ML estimator of the slope would perform better (in the MSE sense) than the two robust estimators used here, but in practice you'd want something with some robustness to influential points.
Best Answer
One example that comes to mind is some GLS estimator that weights observations differently although that is not necessary when the Gauss-Markov assumptions are met (which the statistician may not know to be the case and hence apply still apply GLS).
Consider the case of a regression of $y_i$, $i=1,\ldots,n$ on a constant for illustration (readily generalizes to general GLS estimators). Here, $\{y_i\}$ is assumed to be a random sample from a population with mean $\mu$ and variance $\sigma^2$.
Then, we know that OLS is just $\hat\beta=\bar y$, the sample mean. To emphasize the point that each observation is weighted with weight $1/n$, write this as $$ \hat\beta=\sum_{i=1}^n\frac{1}{n}y_i. $$ It is well-known that $Var(\hat\beta)=\sigma^2/n$.
Now, consider another estimator which can be written as $$ \tilde\beta=\sum_{i=1}^nw_iy_i, $$ where the weights are such that $\sum_iw_i=1$. This ensures that the estimator is unbiased, as $$ E\left(\sum_{i=1}^nw_iy_i\right)=\sum_{i=1}^nw_iE(y_i)=\sum_{i=1}^nw_i\mu=\mu. $$ Its variance will exceed that of OLS unless $w_i=1/n$ for all $i$ (in which case it will of course reduce to OLS), which can for instance be shown via a Lagrangian:
\begin{align*} L&=V(\tilde\beta)-\lambda\left(\sum_iw_i-1\right)\\ &=\sum_iw_i^2\sigma^2-\lambda\left(\sum_iw_i-1\right), \end{align*} with partial derivatives w.r.t. $w_i$ set to zero being equal to $2\sigma^2w_i-\lambda=0$ for all $i$, and $\partial L/\partial\lambda=0$ equaling $\sum_iw_i-1=0$. Solving the first set of derivatives for $\lambda$ and equating them yields $w_i=w_j$, which implies $w_i=1/n$ minimizes the variance, by the requirement that the weights sum to one.
Here is a graphical illustration from a little simulation, created with the code below:
EDIT: In response to @kjetilbhalvorsen's and @RichardHardy's suggestions I also include the median of the $y_i$, the MLE of the location parameter pf a t(4) distribution (I get warnings that
In log(s) : NaNs produced
that I did not check further) and Huber's estimator in the plot.We observe that all estimators seem to be unbiased. However, the estimator that uses weights $w_i=(1\pm\epsilon)/n$ as weights for either half of the sample is more variable, as are the median, the MLE of the t-distribution and Huber's estimator (the latter only slightly so, see also here).
That the latter three are outperformed by the OLS solution is not immediately implied by the BLUE property (at least not to me), as it is not obvious if they are linear estimators (nor do I know if the MLE and Huber are unbiased).