I was reading about Maximum likelihood estimation from various sources on the internet and I noticed that MLE makes an assumption about the data known as IID but I didn't completely understand why is it necessary to make this assumption? Are there any other assumptions that MLE makes?
Solved – Independent and Identically distributed assumption in Maximum likelihood estimation
data visualizationmachine learningmathematical-statisticsmaximum likelihoodstatistical significance
Related Solutions
Unbiasedness isn't necessarily especially important on its own.
Aside a very limited set of circumstances, most useful estimators are biased, however they're obtained.
If two estimators have the same variance, one can readily mount an argument for preferring an unbiased one to a biased one, but that's an unusual situation to be in (that is, you may reasonably prefer unbiasedness, ceteris paribus -- but those pesky ceteris are almost never paribus).
More typically, if you want unbiasedness you'll be adding some variance to get it, and then the question would be why would you do that?
Bias is how far the expected value of my estimator will be too high on average (with negative bias indicating too low).
When I'm considering a small sample estimator, I don't really care about that. I'm usually more interested in how far wrong my estimator will be in this instance - my typical distance from right... something like a root-mean-square error or a mean absolute error would make more sense.
So if you like low variance and low bias, asking for say a minimum mean square error estimator would make sense; these are very rarely unbiased.
Bias and unbiasedness is a useful notion to be aware of, but it's not an especially useful property to seek unless you're only comparing estimators with the same variance.
ML estimators tend to be low-variance; they're usually not minimum MSE, but they often have lower MSE than than modifying them to be unbiased (when you can do it at all) would give you.
As an example, consider estimating variance when sampling from a normal distribution $\hat{\sigma}^2_\text{MMSE} = \frac{S^2}{n+1}, \hat{\sigma}^2_\text{MLE} = \frac{S^2}{n}, \hat{\sigma}^2_\text{Unb} = \frac{S^2}{n-1}$ (indeed the MMSE for the variance always has a larger denominator than $n-1$).
$f(x_i, \theta)$ may not be a probability, it is a density function. In general statistics, we don't want to have to make special exceptions for continuous versus discrete random variables all the time, especially since there is a field of mathematics that gives us a unified approach yet allows us to be rigorous about such things.
The rationale for maximizing the product of the densities of a sample, or the likelihood, is much like the rationale for an integral in calculus. Take height, it is a continuous value. And suppose I have some belief about a "normal, maximum entropy Gaussian" spread to underlie this distribution in a population, and it is parametrized by a mean and standard deviation. My height is measured with error, and even if I knew it to an atomic level I could never actually find a probability associated with that single value. The probability that my height is between 5'10" and 5'11" is small, but between 5'10.25" and 5'10.75" is even smaller, and if I squeeze and squeeze this range into an $\epsilon$-ball, the associated probability goes to 0, even if my height happens to be the mean, mode, and median of the population sample. So how is it that this value which is highly characteristic of the population shows such a small probability? A zen answer might be: the infinitessimal differences make up the whole. By look at the density, or the differential of probability, you actually find that a random observation achieving a mean, mode, median is actually very characteristic: it achieves the highest likelihood of any other value in that density.
Best Answer
Assuming independence is not necessary for maximum likelihood estimation (ML) (or other likelihood based methods). But if independence is a reasonable assumption, then it makes ML easy to implement, since the log likelihood is simply the sum of the individual log likelihoods. But there are lots of examples where ML is used without Independence: time series with ARMA (or ARIMA) models, spatial models including spatial dependence, mixed models where there are correlation between observations within some groups, others.
When is it reasonable to assume independence? is discussed already multiple times on this site, see for example Are "random sample" and "iid random variable" synonyms? or Independence of events in real-life data