[Math] Why does maximum likelihood estimation work the way that it does

probabilitystatistics

I'm wrapping my head around MLE right now and there's something about it that bothers me, irrationally I'm sure. I believe I understand the procedure: essentially we hold our observations fixed and maximize the likelihood function with respect to the parameters to find the parameters which would make a PDF that assigns a maximum value to our observations.

My question it this: why do we care about finding such a PDF? In particular, I'm imagining that we end up with a very skewed PDF so that the expected value is far from the maximum. Or what if we have an even weirder PDF than that? If $f$ is a PDF, it was my understanding that the number $f(x)$ is not particularly meaningful for continuous random variables—it's the area under the curve that we care about. So why aren't we in some way trying to maximize the area under our observations, or taking the expected value into account or something?

Hopefully this question makes a little bit of sense. I can try to clarify if it doesn't.

Best Answer

Suppose you are trying to maximize the likelihood of i.i.d. data $x_1, x_2, \ldots, x_n$ with p.d.f. $f(x)$ and parameter vector $\theta$.

The joint probability of the data given $\theta$ is $$ f(x_1, x_2, \ldots, x_n | \theta) = \prod_i f(x_i | \theta) $$.

The goal is to find $\theta$ which maximizes the joint probability of $x_1, x_2, \ldots, x_n$. Remember, we don't know $\theta$ yet, but we can being the process of estimating it by defining the likelihood function $$ l(\theta|x_1, x_2, \ldots, x_n) = \prod_i f(x_i | \theta)$$.

If you imagine the likelihood function as a unimodal p.d.f., the point which maximizes the likelihood of the data is at the very peak of the hump. We can use calculus to find this point because the maximum of a function has two properties: (1) its derivative is zero and (2) the second derivative is negative, or, $$\frac{\partial}{\partial \theta} l(\theta|x_1, x_2, \ldots, x_n) = 0 $$ and $$ \frac{\partial^2}{\partial \theta^2} l(\theta|x_1, x_2, \ldots, x_n) < 0 $$.

Solving the first order condition for $\theta$ yields a functional form for an estimator maximizing the joint likelihood of the data $x_1, x_2, \ldots, x_n$.

It is good to note these are sufficient conditions for a local maximum. If the p.d.f. is not unimodal as we imagined, then we cannot guarantee our estimate is the maximum likelihood estimator. Generally, numerical methods can be used to explore the likelihood function for a global maximum when its identification is not obvious.

Related Question