Maximum Likelihood – Covariance Estimation from Expectation-Maximization with Missing Data

covariancedata-imputationexpectation-maximizationmaximum likelihoodmissing data

Assuming a multivariate normal distribution with missing data, is there a straightforward way to find the maximum likelihood estimate for covariance using an Expectation-Maximization algorithm?

NOTE: This question is almost solved. The only problem that remains is allowing for more than one missing variable per observation (row).

For example, consider bivariate data given in a $2\times 25$ dimension matrix $\mathbf{Y}$. Five observations are randomly missing from the first column, and we wish to find the ML estimates $\hat{\boldsymbol{\mu}}$ and $\hat{\boldsymbol{\Sigma}}$ using Expectation-Maximization. The log-likelihood function is given by
\begin{equation}
\operatorname{log}L(\boldsymbol{\theta})=-.5\left (n\operatorname{log}(2\pi)+\operatorname{log}|\mathbf W|+(\mathbf{y-\mathbf{X}\boldsymbol{\mu}})'\mathbf{W}^{-1}(\mathbf{y-\mathbf{X}\boldsymbol{\mu}})\right )
\end{equation}
where $n$ is the total number of observations, $\mathbf{W}=\boldsymbol{\Sigma}\otimes \mathbf{I}$, $\mathbf{y}$ is the $n$-length vectorized form of observed values of $\mathbf{Y}$ (NA values are omitted), and $\mathbf{X}$ is the $2n\times 2$ design matrix of zeros and ones denoting whether index $i$ of $\mathbf{y}$ corresponds to column $1$ or column $2$ of row $i$ of $\mathbf{X}$.

If I understand correctly, $\hat{\boldsymbol{\mu}}$ can be found using Expectation-Maximization in which imputations for missing values of $\mathbf{Y}$ are iteratively updated by predictions from linear regression. When the algorithm converges, the imputed values of missing $\mathbf{Y}$ values are equivalent to the ML imputed estimates. However, the covariance of the augmented matrix $\mathbf{Y}_{aug}$ is downwardly biased from imputation. This is demonstrated in the reproducible R example below, in which the ML covariance is numerically estimated using the optim function and compared to the ML covariance of $\mathbf{Y}_{aug}$ obtained from EM. The value $\Sigma_{1,1}$
is too low relative to the ML estimate, whereas the estimates for $\Sigma_{2,1}=\Sigma_{1,2}$ and $\Sigma_{2,2}$ are approximately identical to the ML estimates. Is there a way to correct for the downward bias while avoiding numerical optimization?

That is, my goal is to obtain the ML estimate of $\boldsymbol{\Sigma}$ without numerical optimization, using either a closed-form solution or the EM algorithm itself.

# log likelihood function
LL <- function(sigma,Y)
{
  W <- sigma%x%diag(nrow(Y)) # covariance matrix
  isna <- which(is.na(Y)) # which data are NA
  if(length(isna)>0) W <- W[-isna,-isna] # remove NAs from W
  X <- matrix(0,nrow(Y)*ncol(Y),ncol(Y))
  for(i in 1:ncol(Y)) X[1:nrow(Y)+(i-1)*nrow(Y),i] <- 1 # design matrix
  if(length(isna)>0) X <- X[-isna,] # remove NAs from X

  # mean vector
  mu <- t(solve(t(X)%*%solve(W)%*%X)%*%t(X)%*%solve(W)%*%na.exclude(as.double(Y)))

  # log-likelihood
  logl <- -.5*(length(as.double(Y[!is.na(Y)]))*log(2*pi) + 
         determinant(W)$modulus[[1]] + 
         na.exclude(as.double(Y-matrix(1,nrow(Y))%*%mu)) %*% solve(W) %*% 
         na.exclude(as.double(Y-matrix(1,nrow(Y))%*%mu)))
  logl
}

# build 2x2 positive definite covariance matrix
buildmat <- function(x)
{
  mat <- matrix(0,2,2)
  mat[lower.tri(mat,diag=TRUE)] <- x
  tcrossprod(mat)
}

set.seed(4)
nmissing <- 5 # number of missing observations
dat <- matrix(rnorm(50),ncol=2) # original data
dat_missing <- dat
dat_missing[sample(nrow(dat),nmissing),1] <- NA # missing data in column 1

dat_impute <- dat_missing # data matrix for imputation
missing_ind <- which(is.na(dat_missing[,1])) # index of missing values
dat_impute[missing_ind,1] <- mean(na.exclude(dat_missing[,1]))

# EM imputation
while(sum(abs(dat_impute[missing_ind,1]-predict(lm(dat_impute[,1]~dat_impute[,2]))[missing_ind])>1e-32))
  dat_impute[missing_ind,1] <- predict(lm(dat_impute[,1]~dat_impute[,2]))[missing_ind]

# compare covariance of augmented data to ML estimate
cov(dat_impute)*(nrow(dat)-1)/nrow(dat)
buildmat(optim(chol(cov(na.exclude(dat)))[c(1,3,4)],function(par)
  LL(buildmat(par),dat_missing),control=list(fnscale=-1))$par)

EDIT: Apparently there is a closed-form solution for bivariate data where only one variable has missing data. I am seeking a more general solution for any multivariate data in which all variables may have missing data. I know it should be possible to obtain the corrected ML covariance, as the R package norm can do this (continuing from the above R example):

require(norm) s <- prelim.norm(dat_missing) e <- em.norm(s) getparam.norm(s = s,theta = e)

but unfortunately the underlying FORTRAN routines are less than intuitive.

EDIT 2: Here's a very nice tutorial for bivariate data. Any thoughts on generalizing it to multivariate?

UPDATED Here's my latest attempt. However, something seems to be wrong with my implementation, as the results don't match norm output.

FURTHER UPDATED The code now works for multivariate data, but only if there is only one missing value per observations (one NA per row). There seems to be a problem with weighting covariances if more than one variable is missing for a given observation.

Implementation is based on the following paper (via Randel):

Beale, E. M. L., & Little, R. J. A.. (1975). Missing Values in Multivariate Analysis. Journal of the Royal Statistical Society. Series B (methodological), 37(1), 129–145.

set.seed(4)
dat <- matrix(rnorm(100),ncol=5) # original data
nmissing <- 10

dat_missing <- dat
dat_missing[sample(length(dat_missing),nmissing)] <- NA
is_na <- apply(dat_missing,2,is.na) # index if NAs

dat_impute <- dat_missing # data matrix for imputation

# set initial estimates to means from available data
for(i in 1:ncol(dat_impute)) dat_impute[is_na[,i],i] <- colMeans(dat_missing,na.rm = TRUE)[i]

new_dat_impute <- dat_impute

# starting values for EM
means <- colMeans(dat_impute)
# NOTE: multiplying by (nrow-1)/(nrow) to get ML estimate
# For comparability with norm package output
sigma <- cov(dat_impute)*(nrow(dat_impute)-1)/nrow(dat_impute)

# matrix of regression coefficients -- one for each variable
betas <- matrix(0,ncol(sigma)-1,ncol(sigma))

# carry out EM over 1000 iterations
for(j in 1:1000)
{
  # get updated means and covariance matrix
  means <- colMeans(dat_impute)
  new_sigma <- cov(dat_impute)*(nrow(dat_impute)-1)/nrow(dat_impute)

  # correct for bias in covariance matrix
  for(i in 1:ncol(sigma))
  {
    mis <- crossprod(is_na)
    for(ii in 1:ncol(sigma))
    {
      if(i!=ii)
      {
        # partial correlation squared
        ##### This is where the problem is
        R2 <- (solve(sigma)[i,ii] / sqrt(solve(sigma)[i,i]*solve(sigma)[ii,ii]))^2
        new_sigma[i,ii] <- new_sigma[i,ii] + sigma[i,ii]*(1-R2)*(mis[i,ii])/nrow(dat_impute)
      } else
      {
        R2 <- t(sigma[-i,i]) %*% (solve(sigma[-i,-i]) %*% sigma[-i,i]) / sigma[i,i]
        new_sigma[i,ii] <- new_sigma[i,ii] + sigma[i,ii]*(1-R2)*mis[i,ii]/nrow(dat_impute)
      }
    }
  }
  sigma <- new_sigma

  # get estimates of beta and intercept if any missing data in column i
  for(i in 1:ncol(sigma))
  {
    if(any(is_na[,i]))
    {
      betas[,i] <- solve(sigma[-i,-i]) %*% sigma[-i,i]
      intercept <- means[i] - means[-i] %*% betas[,i]
      dat_impute[which(is_na[,i]),i] <- (cbind(1,dat_impute[,-i]) %*% c(intercept,betas[,i]))[which(is_na[,i]),1]
    }
  }  
}

# compare results to norm package output
s <- prelim.norm(dat_missing)
e <- em.norm(s,criterion=1e-32,showits = FALSE)

# compare means
max(abs(getparam.norm(s,e)[[1]] - means))
# compare covariance matrix
max(abs(getparam.norm(s,e)[[2]] - new_sigma))

Best Answer

It turns out the algorithm is rather simple. Starting out with initial estimates of $\mathbf{\mu}$ and $\boldsymbol{\Sigma}$:

Create a bias matrix $\mathbf{B}$ of dimension $nvar\times nvar$, initialized with zeros.
For each row (observation) with missing data, denote available indices $a$ and missing indices $m$. Given the current estimates of $\boldsymbol{\Sigma}$ and $\mathbf{\mu}$, impute missing values $\hat{\mathbf{y}}_m=\mathbf{\mu}_m+\boldsymbol{\Sigma}_{m,a}\boldsymbol{\Sigma}_{a,a}^{-1}(\mathbf{y}_a-\mathbf{\mu}_a)$. Update $\mathbf{B}_{m,m}$ with $\mathbf{B}_{m,m}+\boldsymbol{\Sigma}_{m,m}-\boldsymbol{\Sigma}_{m,a}\boldsymbol{\Sigma}_{a,a}^{-1}\boldsymbol{\Sigma}_{a,m}$
Calculate $\mathbf{\mu}^{(i+1)}$ and $\boldsymbol{\Sigma}_{biased}$ from newly imputed data. Adjust for bias: $\boldsymbol{\Sigma}^{(i+1)}=\boldsymbol{\Sigma}_{biased}+\mathbf{B}n^{-1}$.

Repeat 1-3 until convergence. For restricted maximum likelihood, replace $n^{-1}$ with $(n-1)^{-1}$ in covariance calculations.

require(norm)
dat <- matrix(rnorm(1000),ncol=5) # original data
nvar <- ncol(dat)
n <- nrow(dat)
nmissing <- 50

dat_missing <- dat
dat_missing[sample(length(dat_missing),nmissing)] <- NA
is_na <- apply(dat_missing,2,is.na) # index if NAs

dat_impute <- dat_missing # data matrix for imputation

# set initial estimates to means from available data
for(i in 1:ncol(dat_impute)) dat_impute[is_na[,i],i] <- colMeans(dat_missing,na.rm = TRUE)[i]

# starting values for EM
means <- colMeans(dat_impute)
# NOTE: multiplying by (nrow-1)/(nrow) to get ML estimate
# For comparability with norm package output
sigma <- cov(dat_impute)*(nrow(dat_impute)-1)/nrow(dat_impute)

# get estimates from norm package for comparison
s <- prelim.norm(dat_missing)
e <- em.norm(s,criterion=1e-32,showits = FALSE)

# carry out EM over 100 iterations
for(j in 1:100)
{
  bias <- matrix(0,nvar,nvar)
  for(i in 1:n)
  {
    row_dat <- dat_missing[i,]
    avail <- which(!is.na(row_dat))
    if(length(avail)<nvar)
    {
      bias[-avail,-avail] <- bias[-avail,-avail] + sigma[-avail,-avail] - sigma[-avail,avail] %*% solve(sigma[avail,avail]) %*% sigma[avail,-avail]
      dat_impute[i,-avail] <- means[-avail] + (sigma[-avail,avail] %*% solve(sigma[avail,avail])) %*% (row_dat[avail]-means[avail])
    }
  }

  # get updated means and covariance matrix
  means <- colMeans(dat_impute)
  biased_sigma <- cov(dat_impute)*(n-1)/n

  # correct for bias in covariance matrix
  sigma <- biased_sigma + bias/n
}


# compare results to norm package output
# compare means
max(abs(getparam.norm(s,e)[[1]] - means))
# compare covariance matrix
max(abs(getparam.norm(s,e)[[2]] - sigma))

Best Answer

Related Solutions

K-Means Clustering – How to Derive K-means Algorithm as a Limit of Expectation Maximization for Gaussian Mixtures

Solved – Hessian matrix for maximum likelihood

Related Question