Maximum Likelihood – Understanding the Equation for Maximizing Model Probability

intuitionmathematical-statisticsmaximum likelihood

We have seen some definitions of common estimators and analyzed their properties. But where did these estimators come from? Rather than guessing that some function might make a good estimator and then analyzing its bias and variance, we would like to have some principle from which we can derive specific functions that are good estimators for different models.
The most common such principle is the maximum likelihood principle.
Consider a set of $m$ examples $\mathbb{X}=\left\{x^{(1)}, \ldots, x^{(m)}\right\}$ drawn independently from the true but unknown data-generating distribution $p_{\text {data }}(\mathbf{x})$.

Let $p_{\text {model }}(\mathbf{x} ; \boldsymbol{\theta})$ be a parametric family of probability distributions over the same space indexed by $\boldsymbol{\theta}$. In other words, $p_{\text {model }}(x ; \boldsymbol{\theta})$ maps any configuration $\boldsymbol{x}$ to a real number estimating the true probability $p_{\text {data }}(x)$.
The maximum likelihood estimator for $\theta$ is then defined as

$$
\begin{aligned}
\theta_{\mathrm{ML}} &=\underset{\theta}{\arg \max } p_{\text {model }}(\mathbb{X} ; \boldsymbol{\theta}) \\
&=\underset{\theta}{\arg \max } \prod_{i=1}^{m} p_{\text {model }}\left(\boldsymbol{x}^{(i)} ; \boldsymbol{\theta}\right)
\end{aligned}
$$

This product over many probabilities can be inconvenient for various reasons. For example, it is prone to numerical underflow. To obtain a more convenient but equivalent optimization problem, we observe that taking the logarithm of the likelihood does not change its arg max but does conveniently transform a product

I am a slow learner in Mathematics. I don't get the bold part as it is too abstract for me.

  • I guess "$p_{\text {model }}(x ; \boldsymbol{\theta})$" which "maps any configuration $\boldsymbol{x}$ to a real number estimating the true probability $p_{\text {data }}(x)$" is a function with "$\theta$" parameter (or parameters) that more or less fit the data.

  • I struggle with $\underset{\theta}{\arg \max }$. I guess that saying that the maximum likelihood estimator for $\theta$ is then defined as $\theta_{\mathrm{ML}} =\underset{\theta}{\arg \max } p_{\text {model }}(\mathbb{X} ; \boldsymbol{\theta})$ means that the maximum likelihood estimator for $\theta$ is actually looking for the "$\theta$" parameter (or parameters) that find the most likely function that explains observed data.

  • I really don't get why $\underset{\theta}{\arg \max } p_{\text {model }}(\mathbb{X} ; \boldsymbol{\theta}) =\underset{\theta}{\arg \max } \prod_{i=1}^{m} p_{\text {model }}\left(\boldsymbol{x}^{(i)} ; \boldsymbol{\theta}\right)$. Why do we need to multiply the progbabilities to find the most likely function that explains observed data?

I took an example by gregmacfarlane from the answer to the question "Maximum Likelihood Estimation (MLE) in layman terms" to understand:

Maximum Likelihood Estimation (MLE) is a technique to find the most likely
function that explains observed data. I think math is necessary, but don't let it
scare you!

Let's say that we have a set of points in the $x,y$ plane, and we want to know
the function parameters $\beta$ and $\sigma$ that most likely fit the data
(in this case we know the function because I specified it to create this
example, but bear with me).

data   <- data.frame(x = runif(200, 1, 10))
data$y <- 0 + beta*data$x + rnorm(200, 0, sigma)
plot(data$x, data$y)

data points

In order to do a MLE, we need to make assumptions about the form of the function.
In a linear model, we assume that the points follow a normal (Gaussian) probability
distribution, with mean $x\beta$ and variance $\sigma^2$: $y = \mathcal{N}(x\beta, \sigma^2)$. The equation of this probability density function is: $$\frac{1}{\sqrt{2\pi\sigma^2}}\exp{\left(-\frac{(y_i-x_i\beta)^2}{2\sigma^2}\right)}$$

What we want to find is the parameters $\beta$ and $\sigma$ that maximize this
probability for all points $(x_i, y_i)$. This is the "likelihood" function, $\mathcal{L}$

$$\mathcal{L} = \prod_{i=1}^n y_i = \prod_{i=1}^n \dfrac{1}{\sqrt{2\pi\sigma^2}}
\exp\Big({-\dfrac{(y_i – x_i\beta)^2}{2\sigma^2}}\Big)$$

For various reasons, it's easier to use the log of the likelihood function:
$$\log(\mathcal{L}) = \sum_{i = 1}^n-\frac{n}{2}\log(2\pi) -\frac{n}{2}\log(\sigma^2) –
\frac{1}{2\sigma^2}(y_i – x_i\beta)^2$$

But I still don't get why we need to compute this multiplication. Did I get wrong the first two points as well?

I am a slow learner in Mathematics and Statistics. Don't hesitate to explain it slowly, or with examples like generic_user's answer to the same question

Best Answer

The multiplication expresses the assumption that the data are independent (as answered in comment by user @whuber).

If the data are not independent, maximum likelihood can still be used, but the calculation might be much more involved than in the independence case. And, the equality in the question is invalid.

Related Question