Statistical Methods – Comparing Maximum Likelihood Estimation (MLE) and Bayes’ Theorem

bayesianmaximum likelihood

In Bayesian theorem, $$p(y|x) = \frac{p(x|y)p(y)}{p(x)}$$, and from the book I'm reading, $p(x|y)$ is called the likelihood, but I assume it's just the conditional probability of $x$ given $y$, right?

The maximum likelihood estimation tries to maximize $p(x|y)$, right? If so, I'm badly confused, because $x,y$ are both random variables, right? To maximize $p(x|y)$ is just to find out the $\hat y$? One more problem, if these 2 random variables are independent, then $p(x|y)$ is just $p(x)$, right? Then maximizing $p(x|y)$ is to maximize $p(x)$.

Or maybe, $p(x|y)$ is a function of some parameters $\theta$, that is $p(x|y; \theta)$, and MLE tries to find the $\theta$ which can maximize $p(x|y)$? Or even that $y$ is actually the parameters of the model, not random variable, maximizing the likelihood is to find the $\hat y$?

UPDATE

I'm a novice in machine learning, and this problem is a confusion from the stuff I read from a machine learning tutorial. Here it is, given an observed dataset $\{x_1,x_2,…,x_n\}$, the target values are $\{y_1,y_2,…,y_n\}$, and I try to fit a model over this dataset, so I assume that, given $x$, $y$ has a form of distribution named $W$ parameterized by $\theta$, that is $p(y|x; \theta)$, and I assume this is the posterior probability, right?

Now to estimate the value of $\theta$, I use MLE. OK, here comes my problem, I think the likelihood is $p(x|y;\theta)$, right? Maximizing the likelihood means I should pick the right $\theta$ and $y$?

If my understanding of likelihood is wrong, please show me the right way.

Best Answer

I think the core misunderstanding stems from questions you ask in the first half of your question. I approach this answer as contrasting MLE and Bayesian inferential paradigms. A very approachable discussion of MLE can be found in chapter 1 of Gary King, Unifying Political Methodology. Gelman's Bayesian Data Analysis can provide details on the Bayesian side.

In Bayes' theorem, $$p(y|x)=\frac{p(x|y)p(y)}{p(x)}$$ and from the book I'm reading, $p(x|y)$ is called the likelihood, but I assume it's just the conditional probability of $x$ given $y$, right?

The likelihood is a conditional probability. To a Bayesian, this formula describes the distribution of the parameter $y$ given data $x$ and prior $p(y)$. But since this notation doesn't reflect your intention, henceforth I will use ($\theta$,$y$) for parameters and $x$ for your data.

But your update indicates that $x$ are observed from some distribution $p(x|\theta,y)$. If we place our data and parameters in the appropriate places in Bayes' rule, we find that these additional parameters pose no problems for Bayesians: $$p(\theta|x,y)=\frac{p(x,y|\theta)p(\theta)}{p(x,y)}$$

I believe this expression is what you are after in your update.

The maximum likelihood estimation tries to maximize $p(x,y|\theta)$, right?

Yes. MLE posits that $$p(x,y|\theta) \propto p(\theta|x,y)$$ That is, it treats the term $\frac{p(\theta,y)}{p(x)}$ as an unknown (and unknowable) constant. By contrast, Bayesian inference treats $p(x)$ as a normalizing constant (so that probabilities sum/integrate to unity) and $p(\theta,y)$ as a key piece of information: the prior. We can think of $p(\theta,y)$ as a way of incurring a penalty on the optimization procedure for "wandering too far away" from the region we think is most plausible.

If so, I'm badly confused, because $x,y,\theta$ are random variables, right? To maximize $p(x,y|\theta)$ is just to find out the $\hat{\theta}$?

In MLE, $\hat{\theta}$ is assumed to be a fixed quantity that is unknown but able to be inferred, not a random variable. Bayesian inference treats $\theta$ as a random variable. Bayesian inference puts probability density functions in and gets probability density functions out, rather than point summaries of the model, as in MLE. That is, Bayesian inference looks at the full range of parameter values and the probability of each. MLE posits that $\hat{\theta}$ is an adequate summary of the data given the model.

Related Question