Bayesian Methods – Relationship Between MAP, EM, and MLE Explained

bayesianexpectation-maximizationmachine learningmaximum likelihood

I am a beginner in machine learning. I can do programming fine but the theory confuses me a lot of the times.

What is the relation between Maximum Likelihood Estimation (MLE), Maximum A posteriori (MAP) estimate, and Expectation-Maximization (EM) algorithm?

I see them used as the methods that actually do the optimization.

Best Answer

Imagine that you have some data $X$ and probabilistic model parametrized by $\theta$, you are interested in learning about $\theta$ given your data. The relation between data, parameter and model is described using likelihood function

$$ \mathcal{L}(\theta \mid X) = p(X \mid \theta) $$

To find the best fitting $\theta$ you have to look for such value that maximizes the conditional probability of $\theta$ given $X$. Here things start to get complicated, because you can have different views on what $\theta$ is. You may consider it as a fixed parameter, or as a random variable. If you consider it as fixed, then to find it's value you need to find such value of $\theta$ that maximizes the likelihood function (maximum likelihood method [ML]). On another hand, if you consider it as a random variable, then this means that it also has some distribution, so you need to make one more assumption about prior distribution of $\theta$, i.e. $p(\theta)$, and you will be using Bayes theorem for estimation

$$ p(\theta \mid X) \propto p(X \mid \theta) \, p(\theta) $$

If you are not interested in estimating the posterior distribution of $\theta$ but only about point estimate that maximizes the posterior probability, then you will be using maximum a posteriori (MAP) method for estimating it.

As about expectation-maximalization (EM), it is an algorithm that can be used in maximum likelihood approach for estimating certain kind of models (e.g. involving latent variables, or in missing data scenarios).

Check the following threads to learn more:
Maximum Likelihood Estimation (MLE) in layman terms
What is the difference between Maximum Likelihood Estimation & Gradient Descent?
Bayesian and frequentist reasoning in plain English
Who Are The Bayesians?
Is there a difference between the "maximum probability" and the "mode" of a parameter?
What is the difference between "likelihood" and "probability"?
Wikipedia entry on likelihood seems ambiguous
Numerical example to understand Expectation-Maximization