Understanding how EM algorithm actually works for missing data

algorithmscomputational mathematicsmachine learningnumerical optimizationstatistics

I am currently studying EM algorithm for handling missing data in a data set. I understand that the final goal of EM algorithm is not to impute data, but to calculate the parameters of interest. However, when I am thinking of a practical example I am really struggling to make sense of the E-step: calculate $Q(\theta, \theta^{(k)}) = \operatorname{E_{Z|X,\theta^{(k)}}} [\ln{p_{XZ}(x, z| \ \theta)}] \nonumber = \int_{\mathcal{Z}} {\ln{p_{XZ}(x, z| \ \theta)} \ p_{Z|X}(z| \ x, \theta^{(k)}) \ dz}$, where $x$ are the realization of $X$ (i.e. the observed variables) and $z$ are the realization of $Z$ (i.e. the variables with missing values).

Now suppose that we have a data set, where all the instances come from a random variable $Z$ which has a bivariate normal distribution, and also suppose that several data are missing in this data set. How will E-step will work in this case (where $\ln{p_{XZ}(x, z| \ \theta)} = \ln{p_{Z}(z| \ \theta)}$ and $p_{Z|X}(z| \ x, \theta^{(k)})=\ln{p_{Z}(z| \ \theta)}$)? Does an imputation for missing values takes place in E-step and if yes how they are imputed? Will they be imputed by their current value of mean parameter?

Best Answer

Having studied this, I would remark that there is quite a lot going on with the EM algorithm if it's unfamiliar. Yours is an interesting question - some of the presentations I have seen on the EM algorithm sometimes don't explicitly specify how imputation of missing data is handled.

Having seen the presentations in some of the more well-known machine learning/statistics textbooks and tutorial papers, I would say that the clearest way in which you can see how imputation of missing values occurs in the EM algorithm is the viewing how it works through expected sufficient statistics.

Missing data imputation using the EM algorithm.

You are entirely correct that the EM algorithm is for maximum-likelihood estimation in the presence of latent variables (which can defined to be missing data), and that imputation/inference of these latent variables is a subroutine for parameter estimation.

And yes, you are correct that you use current estimates of the parameters $\theta^{k}$ (or an initial guess when $k=0$) to impute what the latent variables might be.

There are two senses in which the imputation of the missing data occurs:

  • In the E-step, when you compute the conditional distribution of the missing data $Z$, given the observed variables $X = x$, and current parameter estimates $\theta = \theta^{k}$, that is $p_{Z | X}(Z | X = x, \theta = \theta^{k})$. This is imputation in a probabilistic sense, and using Bayesian semantics, you are computing a posterior distribution on the values that missing data $Z$ can take given observed variables and current parameter estimates.

  • However, if you mean imputation in the sense of "filling in the missing data" using an appropriate point prediction, that occurs in both the 2nd part of the E-step and also the M-step.

  • In the E-step, when you come to compute the expected complete log-likelihood under the conditional distribution $p_{Z | X}(Z | X = x, \theta = \theta^{k})$, you will need to compute expected sufficient statistics, which will require not only filling in the missing values $Z$ with the mean of the conditional distribution/posterior $p_{Z | X}(Z | X = x, \theta = \theta^{k})$, but also requires the variance of the posterior.

  • In the M-step, these expected sufficient statistics are used to update your parameters.

If you are unfamiliar with the expected sufficient statistic viewpoint, I refer you to pages "Section 11.6 Fitting models with missing data", p374-p376, of Machine Learning: A Probabilistic Perspective by Kevin Murphy, which shows you how to deal with the specific context of missing data EM in the multivariate Normal rather than bivariate case (the multivariate case seems to be the norm in most of the main ML textbooks)

Lastly, I would remark that your confusion may be arising from a faulty partitioning of your data into missing and observed variables; and the formulae you have specified in brackets are incorrect due to this faulty partitioning. Again, I refer you to Murphy, but here is the gist of what happens for multivariate Normal distributions. The exposition I've added is from my handwritten notes as Murphy is not entirely clear, in my view, on how the partitioning works.

Our data matrix, is say $\mathbf{Y} \in \mathbb{R}^{D \times N}$, where the $n$th observation of $D$ input variables is a D-dimensional vector $y_n \in \mathbb{R}^D$. The case where $y_n$ is a random vector that has a multivariate Normal distribution, $y_n \sim N(\mu, \Sigma)$ where $\mu \in \mathbb{R}^{D}$ and $\Sigma \in \mathbb{R}^{D \times D}$ is often used to show how this works analytically.

Each of the $y_n$ will have missing components, which we denote as $z_n$; and also observed components which we denote as $x_n$ (you can think of $x_n$ and $z_n$ as sub-vectors of $y_n$). We can now consider partitioning the $y_n$ into an observed sub-vector $x_n$, and missing data sub-vector $z_n$. We assume that each $x_n$ and $z_n$ have a joint Normal distribution, meaning that we can specify the following:

$$y_n = \begin{bmatrix} x_n \\ z_n \\ \end{bmatrix} \sim N \left( \begin{bmatrix} \mu_{n, x} \\ \mu_{n, z} \\ \end{bmatrix}, \begin{bmatrix} \Sigma_{n, xx}, \Sigma_{n, xz} \\ \Sigma_{n, zx}, \Sigma_{n, zz} \\ \end{bmatrix} \right)$$

Where $\mu_{n, x}$ and $\mu_{n, z}$ are sub-vectors of $\mu$ and $\Sigma_{n, xx}, \Sigma_{n, xz}, \Sigma_{n, zx}, \Sigma_{n, zz}$ are sub-matrices of $\Sigma$ using the relevant dimensions defined by $x_n$.

We can now use standard formulae for multivariate Normal conditional distributions to specify the posterior used in the E-step:

$$p(z_n | x_n, \mu, \Sigma) \sim N(m_n, V_n)$$

where the posterior mean and variance are:

$$m_n = \mu_{n, z} + \Sigma_{n, zx}\Sigma_{n, xx}^{-1}(x_n - \mu_{n, x})$$

$$V_n = \Sigma_{n, zz} - \Sigma_{n, zx}\Sigma_{n, xx}^{-1}\Sigma_{n, xz}$$

We now use the posterior mean $m_n$ and variance $V_n$ to compute the expected sufficient statistics $\mathbb{E}[x_n]$ and $\mathbb{E}[x_n x_n^T]$. These are used to compute the $Q(\theta, \theta^{k})$ function in the E-step, and to compute maximum likelihood parameter updates in the M-step, that is, $\mu^{k+1}$ and $\Sigma^{k+1}$.

Further references.

Whilst the expected sufficient statistics viewpoint, in my opinion, is excellent for seeing how data imputation is working in the EM algorithm, it is not as appropriate for understanding why it is we use the expected complete log-likelihood in the first place, nor why we can iteratively maximise this as a surrogate for the observed data/incomplete log-likelihood. The clearest exposition I have seen on this aspect of the EM algorithm is in Chapter 11 of a unpublished book callsed "Probabilistic Graphical Models" by the legendary Michael Jordan, which you can find here.

Related Question