Confusion in MLE for continuous distribution

density functionmaximum likelihoodnormal distributionprobability

Suppose I have a Bernoulli distribution. It is discrete, so the semantics of derivation of MLE as a joint pmf is clear. For sample set $X_1, X_2,\cdots,X_m$,

$$
L(p) = P(X_1=x_1;X_2=x_2;\cdots;X_n=x_m) = \prod_{i=1}^mP(X_i=x_i)
= \prod_{i=1}^mp^{x_i}(1-p)^{1-x_i} \tag{1}
$$

And then we derive $L_{max}(P) = \hat{p} = \dfrac{\sum\limits_{i=1}^mx_i}{m}$. So far so good.

We actually started with finding out the joint probability mass function of all our sample data occurrences. Since each occurrence is independent, we simply multiplied individual pmf.

I am unable to take this same notion to pdf. For eg, normal distribution. Let $X_1, X_2, \cdots, X_m$ bea random sample from a normal distribution $N(\theta_1, \theta_2)$.

Then,

$$
L(\theta_1,\theta_2) = P(X_1=x_1;X_2=x_2;\cdots;X_n=x_m) = \prod_{i=1}^{m} \dfrac{1}{\sqrt{2\pi\theta_2}}{\text{exp}}{\Big[ -\dfrac{ (x_i-\theta_1)^2 }{2\theta_2} \Big]} \tag{2}
$$

Question 1:
It is here I am stuck. The individual probabilities of $P(X_i=x_i)$ are $0$ for continuous pdf without a continuity correction. So how do we justify doing above step? What is the notion I am missing here?

My take:
Here is my take so far but I doubt if its correct. Unlike a pmf which directly gives $P(X_i=x_i)$, a pdf only a function and always needs integration to find the probability area. That is,

If $x_1$ is a sample observation from $N(\theta_1, \theta_2)$, then of course $P(X_1=x_1)=0$, and we are not interested in that in particular (which was a wrong notion implicitly implanted while attempting joint pmf). Instead we are interested in a collective probability density function of all samples' individual probability densities.

That is, below is a continuous pdf for sample $X_1$

$$
A = f(x_1; \theta_1, \theta_2) = \dfrac{1}{\sqrt{2\pi\theta_2}}{\text{exp}}{\Big[ -\dfrac{ (x_1-\theta_1)^2 }{2\theta_2}}\Big] \tag{3}
$$

But when we want to find a probability with above pdf its always in a range. For example,

$$
P(X_1 \leq a) = \int_{-\infty}^{a} f(x_1; \theta_1, \theta_2)dx_1 \tag{4}
$$

Similarly, for another sample $X_2$ from same pdf,

$$
B = f(x_2; \theta_1, \theta_2) = \dfrac{1}{\sqrt{2\pi\theta_2}}{\text{exp}}{\Big[ -\dfrac{ (x_2-\theta_1)^2 }{2\theta_2}}\Big] \tag{5}
$$

And for that, for an interesting range, the probability could be something like below.

$$
P(X_2 \leq b) = \int_{-\infty}^{b} f(x_2; \theta_1, \theta_2)dx_2 \tag{6}
$$

Note A and B are the functions while, eq. $4$ and $6$ denote a probability calculated out of those functions. When we say, we are interested in joint pdf, we are interested in the multiplication of the functions A and B (because they are independent), and not probabilities like $4$ and $6$. The probability of any joint interested event could be calculated in resultant function AB. That is,

$$
AB = f(x_1,x_2;\theta_1,\theta_2) = \prod\limits_{i=1}^{2} \dfrac{1}{\sqrt{2\pi\theta_2}}{\text{exp}}{\Big[ -\dfrac{ (x_i-\theta_1)^2 }{2\theta_2}}\Big] \tag{7}
$$

And then, in this joint pdf I could calculate interested probabilities, for example,

$$
P(X_1 \leq a; X_2 \leq b) = \int_{-\infty}^{x_1=a}\int_{-\infty}^{x_2=b} \prod\limits_{i=1}^{2} \dfrac{1}{\sqrt{2\pi\theta_2}}{\text{exp}}{\Big[ -\dfrac{ (x_i-\theta_1)^2 }{2\theta_2}}\Big] \tag{8}
$$

Generalizing,

$$
P(X_1 \leq x_1; X_2 \leq x_2) = \prod\limits_{i=1}^{2} \int_{-\infty}^{x_i} \dfrac{1}{\sqrt{2\pi\theta_2}}{\text{exp}}{\Big[ -\dfrac{ (x_i-\theta_1)^2 }{2\theta_2}}\Big] \tag{9}
$$

Not just left area, but any probability of interest could be calculated after this step. For example,

$$
P(X_1 \geq x_1; X_2 \geq x_2) = \prod\limits_{i=1}^{2} \int_{x_i}^{\infty} \dfrac{1}{\sqrt{2\pi\theta_2}}{\text{exp}}{\Big[ -\dfrac{ (x_i-\theta_1)^2 }{2\theta_2}}\Big] \tag{10}
$$

This is why, unlike pmf, for a pdf,

$$
f(x_1,x_2;\theta_1,\theta_2) = f(x_1;\theta_1,\theta_2)f(x_2;\theta_1,\theta_2) \\
\neq P(X_1 \leq x_1; X_2 \leq x_2) \\
\neq P(X_1 \geq x_1; X_2 \geq x_2) \\
\neq P(X_1 = x_1; X_2 = x_2)
$$

Question 2:
Can you please confirm if this understanding is correct and why if not so, and what am I still missing?

Note:

  1. I am aware of another question which discusses similar issue, but I could not yet find convincing answer there, so I am asking again here in a way I understood the problem and to add my understanding. I also could not find any convincing answer anywhere else. 🙁

Best Answer

It is here I am stuck. The individual probabilities of P(Xi=xi) are 0 for continuous pdf without a continuity correction. So how do we justify doing above step? What is the notion I am missing here?

You need to use probability density instead of probabilities. Definition of MLE:

The likelihood function is always defined as a function of the parameter ${\displaystyle \theta }$ equal to (or sometimes proportional to) the density of the observed data with respect to a common or reference measure, for both discrete and continuous probability distributions. So, your step is completely justified.

True that the individual probabilities are zero and to get probabilities from pdfs you need to integrate over some interval. Have a look at this answer on CV.

$L(\theta_1,\theta_2) = p(X_1=x_1;X_2=x_2;\cdots;X_m=x_m|\theta_1,\theta_2) = \prod_{i=1}^{m} \dfrac{1}{\sqrt{2\pi\theta_2}}{\text{exp}}{\Big[ -\dfrac{ (x_i-\theta_1)^2 }{2\theta_2} \Big]} \tag{1}$

where $\theta_1,\theta_2$ are mean and variance respectively.

And $p(X_1=x_1;X_2=x_2;\cdots;X_m=x_m)$ is the joint density.

And, the likelihood will be:

$L(\theta_1,\theta_2) = p(X_1=x_1;X_2=x_2;\cdots;X_m=x_m|\theta_1,\theta_2) =\dfrac{1}{\sqrt{(2\pi\theta_2)^m}}{\text{exp}}{-\frac{1}{2\theta_2}}{\sum_{i = 1}^{m}{(x_i-\theta_1)^2 }} \tag{2}$

To get the values of $\theta_1,\theta_2$, maximize the likelihood by taking derivatives w.r.t the desired variable.

Instead we are interested in a collective probability density function of all samples' individual probability densities.

I doubt this statement. You are interested in pdf of the sample from a normal population. But, you are missing that sample points are realizations of a distribution. They are not random variables but rather a realization of random variables.

From wiki:

A sample concretely represents the results of n experiments in which the same quantity is measured. For example, if we want to estimate the average height of members of a particular population, we measure the heights of n individuals. Each measurement is drawn from the probability distribution F characterizing the population, so each measured height ${\displaystyle x_{i}} $ is the realization of a random variable ${\displaystyle X_{i}}$ with distribution F. Note that a set of random variables (i.e., a set of measurable functions) must not be confused with the realizations of these variables (which are the values that these random variables take). In other words, ${\displaystyle X_{i}}$ is a function representing the measurement at the i-th experiment and ${\displaystyle x_{i}=X_{i}(\omega )}$ is the value obtained when making the measurement.

Related Question