Solved – Expectation-Maximization Algorithm for Binomial

binomial distributionexpectation-maximizationlatent-variablemultinomial-distribution

I have a multinomial distribution with four outcomes, with a pdf:

$$p(x_1,x_2,x_3,x_4)=\frac{n!}{x_1!x_2!x_3!x_4!}p_1^{x_1}p_2^{x_2}p_3^{x_3}p_4^{x_4}, \sum_{i=1}^4x_i=n, \sum_{i=1}^4p_i=1$$

The probabilities are related to the single parameter $0\le\theta\le 1$
\begin{align*}
p_1&=\frac{1}{2}+\frac{1}{4}\theta\\
p_2&=\frac{1}{4}-\frac{1}{4}\theta\\
p_3&=\frac{1}{4}-\frac{1}{4}\theta\\
p_4&=\frac{1}{4}\theta
\end{align*}

If we have an observation $x=(x_1,x_2,x_3,x_4)$, the log-likelihood is:

$$l(\theta)=x_1\log(2+\theta)+(x_2+x_3)\log(1-\theta)+x_4\log(\theta)+c$$

Also, I will suppose $x=(125,18,20,34).$

1) I start by finding the MLE of $\theta$ by simply maximizing its log-likelihood. I took the derivative of the log-likelihood with respect to $\theta$ and set it equal to zero:
\begin{align*}
\frac{x_1}{2+\theta}-\frac{x_2+x_3}{1-\theta}+\frac{x_4}{\theta}&=0
\\
\frac{125}{2+\theta}-\frac{38}{1-\theta}+\frac{34}{\theta}&=0
\\
197\theta^2-15\theta-68 &=0
\end{align*}
Using the quadratic formula I get: $\theta \in \{0.6268, -0.5507\}$.
$\theta$ can't be negative here, and the values $\theta\to 0$ and $\theta\to 1$ do not approach minima, so the MLE of $\theta$ is $\hat\theta=0.6268$.

I do not know if I was able to successfully drive the MLE by
maximizing its log-likelihood. If it is correct, I also do not know if
it would be appropriate to leave it in this "plus or minus" form.

2) Now I would like to compare what I got in (1) with the EM algorithm. To do so, I will consider a multinomial with five classes formed from the original multinomial by splitting the first class into two with probabilities $\frac{1}{2}$ and $\frac{\theta}{4}$. The original variable $x_1$ is split into $x_1=x_{11}+x_{12}$. Now, we have a MLE of $\theta$ by considering $x_{12}+x_4$ to be a realization of a binomial with $n=x_{12}+x_4+x_2+x_3$ and $p=\theta$. However, we do not know $x_{12}$, and the complete data log-likelihood is:

$$l_c(\theta)=(x_{12}+x_4)\log(\theta)+(x_2+x_3)\log(1-\theta)$$

I would like to develop an EM algorithm for estimating $\theta$. I am told that I should be able to combine the E-step and M-step together, i.e., $\hat\theta^{(t+1)}$ can be expressed in terms of $\hat\theta^{(t)}$.

I am very stuck with conceptualizing how to approach this problem. I
have read about the EM algorithm (including the famous Nature article
(Numerical example to understand Expectation-Maximization)).

All I can say right now is that the binomial must follow:

${n \choose k}p^k(1-p)^{(n-k)}$

and $n=x_{12}+72$

But I am very lost at what I would do for the expectation and maximization steps! In fact, I want to implement this in R, and all I can get out is:

x = c(0, 18, 20, 34)  
t=.5  
l = (x12+x[4])*log(t)+(x[2]+x[3])*log(1-t)

And already I am stuck because I do not have x12 and I am guessing
that I should start t (theta) at a default value of 0.5. Any advice on
how to start the algorithm, and how to perform E and M steps would
really help me!

Best Answer

Extracted from our book, Monte Carlo Statistical Methods (except for the sentence in italics):

A classic (perhaps overused) example of the EM algorithm is the genetics problem (see Rao (1973), Dempster, Laird and Rubin (1977)), where observations $(x_1,x_2,x_3,x_4)$ are gathered from the multinomial distribution $$ \mathfrak{M}\left( n;{1\over 2} + {\theta \over 4}, {1\over 4}(1-\theta), {1\over 4}(1-\theta), {\theta \over 4}\right). $$ Estimation is easier if the $x_1$ cell is split into two cells, so we create the augmented model$$ (z_1,z_2,x_2,x_3,x_4) \sim \mathfrak{M}\left( n;{1\over 2}, {\theta \over 4}, {1\over 4}(1-\theta), {1\over 4}(1-\theta), {\theta \over 4}\right), $$ with $x_1=z_1+z_2$. [This means that $(Z_1,Z_2)$ is a Multinomial $\mathfrak{M}\left( x_1; {1\over 2}, {\theta \over 4}\right)$, therefore that$$\mathbb{E}[Z_2|X_1=x_1]=\dfrac{\theta/4}{1/2+\theta/4}x_1=\dfrac{\theta}{2+\theta}x_1\,.\Big]$$ The complete-data likelihood function is then simply $\theta^{z_2+x_4} (1-\theta)^{x_2+x_3}$, as opposed to the observed-data likelihood function $(2+\theta)^{x_1}\theta^{x_4} (1-\theta)^{x_2+x_3}$. The expected complete log-likelihood function is \begin{eqnarray*} &&\mathbb{E}_{\theta_0} [ (Z_2+x_4) \log\theta + (x_2+x_3) \log (1-\theta) ] \\ &&\qquad = \left( {\theta_0 \over 2+\theta_0}x_1 + x_4 \right) \log\theta + (x_2+x_3) \log (1-\theta) \;, \end{eqnarray*} which can easily be maximized in $\theta$, leading to $$ \hat\theta_1 = \frac{\displaystyle{{\theta_0\,x_1\over 2+\theta_0}} + x_4}{\displaystyle{{\theta_0\,x_1\over 2+\theta_0}} +x_2+x_3+x_4} \;. $$

Related Solutions

Solved – Motivation of Expectation Maximization algorithm

Likelihood vs. log-likelihood

As has already been said, the $\log$ is introduced in maximum likelihood simply because it is generally easier to optimize sums than products. The reason we don't consider other monotonic functions is that the logarithm is the unique function with the property of turning products into sums.

Another way to motivate the logarithm is the following: Instead of maximizing the probability of the data under our model, we could equivalently try to minimize the Kullback-Leibler divergence between the data distribution, $p_\text{data}(x)$, and the model distribution, $p(x \mid \theta)$,

$$D_\text{KL}[p_\text{data}(x) \mid\mid p(x \mid \theta)] = \int p_\text{data}(x) \log \frac{p_\text{data}(x)}{p(x \mid \theta)} \, dx = const - \int p_\text{data}(x)\log p(x \mid \theta) \, dx.$$

The first term on the right-hand side is constant in the parameters. If we have $N$ samples from the data distribution (our data points), we can approximate the second term with the average log-likelihood of the data,

$$\int p_\text{data}(x)\log p(x \mid \theta) \, dx \approx \frac{1}{N} \sum_n \log p(x_n \mid \theta).$$

An alternative view of EM

I am not sure this is going to be the kind of explanation you are looking for, but I found the following view of expectation maximization much more enlightening than its motivation via Jensen's inequality (you can find a detailed description in Neal & Hinton (1998) or in Chris Bishop's PRML book, Chapter 9.3).

It is not difficult to show that

$$\log p(x \mid \theta) = \int q(z \mid x) \log \frac{p(x, z \mid \theta)}{q(z \mid x)} \, dz + D_\text{KL}[q(z \mid x) \mid\mid p(z \mid x, \theta)]$$

for any $q(z \mid x)$. If we call the first term on the right-hand side $F(q, \theta)$, this implies that

$$F(q, \theta) = \int q(z \mid x) \log \frac{p(x, z \mid \theta)}{q(z \mid x)} \, dz = \log p(x \mid \theta) - D_\text{KL}[q(z \mid x) \mid\mid p(z \mid x, \theta)].$$

Because the KL divergence is always positive, $F(q, \theta)$ is a lower bound on the log-likelihood for every fixed $q$. Now, EM can be viewed as alternately maximizing $F$ with respect to $q$ and $\theta$. In particular, by setting $q(z \mid x) = p(z \mid x, \theta)$ in the E-step, we minimize the KL divergence on the right-hand side and thus maximize $F$.

Binomial Distribution – Using Likelihood Ratio Test for Equality of Several Proportions

In the following, I'm going to use the numbers supplied in the exercise. The null hypothesis is $H_0:p_1=p_2=p_3=p_4=p$. As you wrote, the likelihood function is $$ L(\mathbb{p})=\prod_{i=1}^4{200\choose y_i}p_i^{y_i}(1-p_i)^{200-y_i} $$ where $y_i$ are the number of voters favoring $A$ in ward $i$. Under $H_0$, the MLE of $p$ is $\hat{p}=\sum_{i=1}^{4}y_i/800$. Otherwise, we have $\hat{p}_i=y_i/200$ for $i=1, 2, 3, 4$. Plugging these into the likelihood function and forming the ratio, we get $$ \lambda = \dfrac{\left(\dfrac{\sum y_i}{800}\right)^{\sum y_i}\left(1 - \dfrac{\sum y_i}{800}\right)^{800 - \sum y_i}}{\prod_{i=1}^{4}\left(\dfrac{y_i}{200}\right)^{y_i}\left(1 - \dfrac{y_i}{200}\right)^{200 - y_i}}. $$ According to theorem 10.2 in the book, for large $n$, $-2\log(\lambda)$ has approximately a $\chi^2$ distribution with $\nu$ degrees of freedom when $H_0$ is true, where $\nu$ is the difference between the number of freely varying parameters in $\Omega$ and the number of such parameters in $\Omega_0$. Here, we have $4 - 1 = 3$ degrees of freedom.

Using the data provided in the book, we have $y_1 = 76, y_2 = 53, y_3 = 59, y_4 = 48$. According to the formula above, I get $-2\log(\lambda)=10.54$ and a $p$-value of $0.015$. This is very similar to what Rs prop.test gives:

y <- c(76, 53, 59, 48)
prop.test(y, rep(200, 4), correct = FALSE)

X-squared = 10.722, df = 3, p-value = 0.01333
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4 
 0.380  0.265  0.295  0.240

Best Answer

Related Solutions

Solved – Motivation of Expectation Maximization algorithm

Binomial Distribution – Using Likelihood Ratio Test for Equality of Several Proportions

Related Question