Solved – Why Expectation Maximization is important for mixture models

expectation-maximizationgaussian mixture distributionmachine learningoptimization

There are many literature emphasize Expectation Maximization method on mixture models (Mixture of Gaussian, Hidden Markov Model, etc.).

Why EM is important? EM is just a way to do optimization and is not widely used as gradient based method (gradient decent or newton's / quasi-newton method) or other gradient free method discussed HERE. In addition, EM still has local minima problem.

Is it because the process is intuitive and can be easily turn into code? Or what other reasons?

Best Answer

In principle, both EM and standard optimization approaches can work for fitting mixture distributions. Like EM, convex optimization solvers will converge to a local optimum. But, a variety of optimization algorithms exist for seeking better solutions in the presence of multiple local optima. As far as I'm aware, the algorithm with best convergence speed will depend on the problem.

One benefit of EM is that it naturally produces valid parameters for the mixture distribution on every iteration. In contrast, standard optimization algorithms would need constraints to be imposed. For example, say you're fitting a Gaussian mixture model. A standard nonlinear programming approach would require constraining covariance matrices to be positive semidefinite, and constraining mixture component weights to be nonnegative and sum to one.

To achieve good performance on high dimensional problems, a nonlinear programming solver typically needs to exploit the gradient. So, you'd have to either derive the gradient or compute it with automatic differentiation. Gradients are also needed for constraint functions if they don't have a standard form. Newton's method and related approaches (e.g. trust region methods) need the Hessian as well. Finite differencing or derivative-free methods could be used if the gradient is unavailable, but performance tends to scale poorly as the number of parameters increases. In contrast, EM doesn't require the gradient.

EM is conceptually intuitive, which is a great virtue. This often holds for standard optimization approaches as well. There are many implementation details, but the overall concept is simple. It's often possible to use standard optimization solvers that abstract these details away under the hood. In these cases, a user just has to supply the objective function, constraints, and gradients, and have enough working knowledge to select a solver that's well suited for the problem. But, specialized knowledge is certainly required if it gets to the point where the user has to think about or implement low-level details of the optimization algorithm.

Another benefit of the EM algorithm is that it can be used in cases where some data values are missing.

Also of interest (including the comments):