Maximum likelihood estimation intuition for continuous distributions

estimationintuitionmaximum likelihoodprobabilitystatistics

I was just revisiting the fundamentals and rationale of maximum likelihood estimation when I realised I can't rationalise the continuous case as opposed to the discrete case.

For a discrete random variable, say $X$, with PMF $p_X(x\mid\theta)$, suppose we have an i.i.d sample $(X_1, X_2, \ldots, X_n)$ from this distribution. Then, the joint PMF will be given by:
$$p_{X_1,\ldots,X_n}(x_1,\ldots,x_n\mid\theta) = \prod_{i=1}^n p_X(x_i\mid\theta).$$
We can use the joint PMF here to calculate probabilities of specific vectors, like the probability that $(X_1, \ldots, X_n)$ actually took on observed values $(x_1, \ldots, x_n)$. However, this depends on the fixed $\theta$. The MLE methodology suggests that we look for a value of $\theta$ that maximises the probability that $(X_1, \ldots, X_n) = (x_1, \ldots, x_n)$. This is why we have
$$ L(\theta \mid x_1,\ldots,x_n) = \prod_{i=1}^n p_X(x_i\mid\theta),$$ which outputs the probability of the observed values occurring, depending on the value of $\theta$ we select. This function makes sense to maximise as it directly corresponds to probability values (though $L$ is not a PMF/PDF itself).

My confusion arises when we have $X$ continuous, with PDF $f_X(x\mid\theta)$. We have the joint PDF
$$ f_{X_1,\ldots,X_n}(x_1,\ldots,x_n\mid\theta)=\prod_{i=1}^n f_X(x_i\mid\theta). $$
Now differently, this function evaluated on the observed data $(x_1,\ldots,x_n)$ does NOT correspond to the probability of the sample occurring under the distribution $f$. Of course, the probability that the vector $ (X_1, \ldots, X_n) $ is equal to the particular sample is $0$ since the distribution is now continuous. So why is the next step to maximise the joint density's value evaluated on the data instead? Is this because the volume under the joint PDF around the observed point $(x_1,\ldots,x_n)$ would increase?

I'm trying to conceptualise the trivial case where $n=1$ and MLE suggests I simply select a $\theta$ to maximise $f_{X_1}$ at the observed point. But I can't figure out why that would maximise the probability of observing $x_1$ itself.

This question might be a little trivial, but I'm struggling to find any answers on this specific concept. Any help is appreciated! Thank you 🙂

Best Answer

Some discussion here and here.

Long story short, the likelihood function should be thought of as some measure of goodness of fit of a model to some data, and this quantity should be comparable for different models for the same data. See this Wikipedia section for a description of how a likelihood function is generally defined. In the special case of discrete distributions, this general notion happens to coincide with "the probability of seeing my data under this model," but more generally it is not a probability, as you note. It is probably best not to think of likelihoods as probabilities.

Related Question