From what little I know the EM algorithm can be used to find the maximum likelihood when setting to zero the partial derivatives with respect to the parameters of the likelihood gives a set of equations that cannot be solved analytically. But is the EM algorithm needed instead of using some numerical technique to try to find a maximum of the likelihood with respect to the constraint of the set of equations mentioned.
Solved – Why is the expectation maximization algorithm used
expectation-maximization
Related Solutions
[Note: This is my answer to the Dec. 19, 2014, version of the question.]
If you operate the change of variable $y=x^2$ in your density $$f_X(x|\alpha,\beta,\sigma)=\frac{1}{\Gamma \left( \alpha \right)\beta^{\alpha}}\exp\left\{{-\frac{x^2}{2\sigma^{2}}\frac{1}{\beta}}\right\}\frac{x^{2\alpha-1}}{2^{\alpha-1}\sigma^{2\alpha}}\mathbb{I}_{{\mathbb{R}}^{+}}(x) $$ the Jacobian is given by $\dfrac{\text{d}y}{\text{d}x}= 2x = 2y^{1/2}$ and hence \begin{align*} f_Y(y|\alpha,\beta,\sigma)&=\frac{1}{\Gamma \left( \alpha \right)\beta^{\alpha}}\exp\left\{{-\frac{y}{2\sigma^{2}}\frac{1}{\beta}}\right\}\frac{y^{\frac{2\alpha-1}{2}}}{2^{\alpha-1}\sigma^{2\alpha}}\frac{1}{2 y^{1/2}}\mathbb{I}_{{\mathbb{R}}^{+}}(y)\\ &=\frac{1}{\Gamma \left( \alpha \right)\beta^{\alpha}}\exp\left\{{-\frac{y}{2\sigma^{2}}\frac{1}{\beta}}\right\}\frac{y^{{\alpha-1}}}{2^{\alpha}\sigma^{2\alpha}}\mathbb{I}_{{\mathbb{R}}^{+}}(y) \end{align*} This shows that
- This is a standard $\mathcal{G}(\alpha,2\sigma^2\beta)$ model, i.e. you observe $$(x_1^2,\ldots,x_n^2)=(y_1,\ldots,y_n)\stackrel{\text{iid}}{\sim}\mathcal{G}(\alpha,\eta);$$
- the model is over-parametrised since only $\eta=2\sigma^2\beta$ can be identified;
- EM is not necessary to find the MLE of $(\alpha,\eta)$, which is not available in closed form but solution of$$\hat\eta^{-1}=\bar{y}/\hat{\alpha}n\qquad\log(\hat{\alpha})-\psi(\hat{\alpha})=\log(\bar{y})-\frac{1}{n}\sum_{i=1}^n\log(y_i)$$ where $\psi(\cdot)$ is the di-gamma function. This paper by Thomas Minka indicates fast approximations to the resolution of the above equation.
Extracted from our book, Monte Carlo Statistical Methods (except for the sentence in italics):
A classic (perhaps overused) example of the EM algorithm is the genetics problem (see Rao (1973), Dempster, Laird and Rubin (1977)), where observations $(x_1,x_2,x_3,x_4)$ are gathered from the multinomial distribution $$ \mathfrak{M}\left( n;{1\over 2} + {\theta \over 4}, {1\over 4}(1-\theta), {1\over 4}(1-\theta), {\theta \over 4}\right). $$ Estimation is easier if the $x_1$ cell is split into two cells, so we create the augmented model$$ (z_1,z_2,x_2,x_3,x_4) \sim \mathfrak{M}\left( n;{1\over 2}, {\theta \over 4}, {1\over 4}(1-\theta), {1\over 4}(1-\theta), {\theta \over 4}\right), $$ with $x_1=z_1+z_2$. [This means that $(Z_1,Z_2)$ is a Multinomial $\mathfrak{M}\left( x_1; {1\over 2}, {\theta \over 4}\right)$, therefore that$$\mathbb{E}[Z_2|X_1=x_1]=\dfrac{\theta/4}{1/2+\theta/4}x_1=\dfrac{\theta}{2+\theta}x_1\,.\Big]$$ The complete-data likelihood function is then simply $\theta^{z_2+x_4} (1-\theta)^{x_2+x_3}$, as opposed to the observed-data likelihood function $(2+\theta)^{x_1}\theta^{x_4} (1-\theta)^{x_2+x_3}$. The expected complete log-likelihood function is \begin{eqnarray*} &&\mathbb{E}_{\theta_0} [ (Z_2+x_4) \log\theta + (x_2+x_3) \log (1-\theta) ] \\ &&\qquad = \left( {\theta_0 \over 2+\theta_0}x_1 + x_4 \right) \log\theta + (x_2+x_3) \log (1-\theta) \;, \end{eqnarray*} which can easily be maximized in $\theta$, leading to $$ \hat\theta_1 = \frac{\displaystyle{{\theta_0\,x_1\over 2+\theta_0}} + x_4}{\displaystyle{{\theta_0\,x_1\over 2+\theta_0}} +x_2+x_3+x_4} \;. $$
Best Answer
The question is legit and I had the same confusion when I first learnt the EM algorithm.
In general terms, the EM algorithm defines an iterative process that allows to maximize the likelihood function of a parametric model in the case in which some variables of the model are (or are treated as) "latent" or unknown.
In theory, for the same purpose, you can use a minimization algorithm to numerically find the maximum of the likelihood function for all parameters. However in real situation this minimization would be:
A very common application of the EM method is fitting a mixture model. In this case considering the variable that assign each sample to one of the component as "latent" variables the problem is greatly simplified.
Lets look at an example. We have N samples $s = \{s_i\}$ extracted from a mixture of 2 normal distributions. To find the parameters without EM we should minimize:
$$-\log \mathcal{L}(x,\theta) = -\log\Big[ a_1 \exp\Big( \frac{(x-\mu_1)^2}{2\sigma_1^2}\Big) + a_2 \exp\Big(\frac{(x-\mu_2)^2}{2\sigma_2^2}\Big) \Big]$$
On the contrary, using the EM algorithm, we first "assign" each sample to a component (E step) and then fit (or maximize the likelihood of) each component separately (M step). In this example the M-step is simply a weighted mean to find $\mu_k$ and $\sigma_k$. Iterating over these two steps is a simpler and more robust way to minimize $-\log \mathcal{L}(x,\theta)$.