I found that for a simple linear regression model, both OLS and maximum likelihood method (assuming Normal distribution) give the same output (parameter values). From this, can we say that OLS also make implicit assumption about the Normal distribution or vice-versa? I am not interested in why both produce same value but which one make less stringent assumption about the data?
OLS vs Maximum Likelihood – Comparison in Linear Regression under Normal Distribution
least squaresmaximum likelihoodnormal distributionregression
Related Solutions
There seems to be multiple questions here. To use maximum likelihood estimation (mle) you need to be able to write down the likelihood function, that is, the joint density (maybe joint probability mass function) of the observations. There is no need for there to be identical distributions, you see that by using likelihood methods for regression.
But, in practice, likelihood methods are most useful when the distribution of the data can be written using a relatively small number of parameters which are common for all the observations. So, a unique variance for each observation cannot be expected to work---that would lead to more parameters than observations. But, still, there is no need to assume the same variance for all the observations. What you need is some way to describe how the variance varies, maybe as a function of the mean, maybe as a function of some known covariate.
You could even, in principle, in a regression model, assume normal errors for some of the observations and Laplace errors for others, there is no obstacle in principle. But it is difficult to think of a situation where such would be a natural way to model!
If you are using R, there are some packages that allows for separate modeling of expectation and variance, among them, dglm
and gamlss
. See for instance Simulate linear regression with heteroscedasticity and Is it possible to calculate variable confidence intervals, conditional on $\hat{Y}$ to address heteroscedasticity?
As you move sufficiently far away from normality, all linear estimators may be arbitrarily bad.
Knowing that you can get the best of a bad lot (i.e. the best linear unbiased estimate) isn't much consolation.
If you can specify a suitable distributional model (ay, there's the rub), maximizing the likelihood has both a direct intuitive appeal - in that it "maximizes the chance" of seeing the sample you did actually see (with a suitable refinement of what we mean by that for the continuous case) and a number of very neat properties that are both theoretically and practically useful (e.g. relationship to the Cramer-Rao lower bound, equivariance under transformation, relationship to likelihood ratio tests and so forth). This motivates M-estimation for example.
Even when you can't specify a model, it is possible to construct a model for which ML is robust to contamination by gross errors in the conditional distribution of the response -- where it retains pretty good efficiency at the Gaussian but avoids the potentially disastrous impact of arbitrarily large outliers.
[That's not the only consideration with regression, since there's also a need for robustness to the effect of influential outliers for example, but it's a good initial step]
As a demonstration of the problem with even the best linear estimator, consider this comparison of slope estimators for regression. In this case there are 100 observations in each sample, x is 0/1, the true slope is $\frac12$ and errors are standard Cauchy. The simulation takes 1000 sets of simulated data and computes the least squares estimate of slope ("LS") as well as a couple of nonlinear estimators that could be used in this situation (neither is fully efficient at the Cauchy but they're both reasonable) - one is an L1 estimator of the line ("L1") and the second computes a simple L-estimate of location at the two values of x and fits a line joining them ("LE").
The top part of the diagram is a boxplot of those thousand slope estimates for each simulation. The lower part is the central one percent (roughly, it is marked with a faint orange-grey box in the top plot) of that image "blown up" so we can see more detail. As we see the least squares slopes range from -771 to 1224 and the lower and upper quartiles are -1.24 and 2.46. The error in the LS slope was over 10 more than 10% of the time. The two nonlinear estimators do much better -- they perform fairly similarly to each other, none of the 1000 slope estimates in either case are more than 0.84 from the true slope and the median absolute error in the slope is in the ballpark of 0.14 for each (vs 1.86 for the least squares estimator). The LS slope has a RMSE of 223 and 232 times that of the L1 and LE estimators in this case (that's not an especially meaningful quantity, however as the LS estimator doesn't have a finite variance when you have Cauchy errors).
There are dozens of other reasonable estimators that might have been used here; this was simply a quick calculation to illustrate that even the best/most efficient linear estimators may not be useful. An ML estimator of the slope would perform better (in the MSE sense) than the two robust estimators used here, but in practice you'd want something with some robustness to influential points.
Best Answer
OLS does not make a normality assumption for the model errors. OLS can be used under different distributional assumptions and the estimator will still make sense as the minimum variance linear unbiased estimator.
Maximum likelihood (ML) can also accommodate different distributions, but the distribution has to be chosen in advance. If the actual distribution appears to be different from the assumed distribution, ML estimator will no longer make sense as the estimator that maximizes the joint probability density of the data.
Thus we can say that in a particular application ML makes a more stringent assumption about the model errors than OLS does.