Solved – Covariance matrix estimation in presence of missing data

covariance-matrixmissing datar

I want to estimate a covariance matrix from data with some missing values. Ideally I'd like an R package but python could be ok.

R has some built in ways of doing this. You can use

cov.mat=cov(X,use='pairwise')

Or the same using cor (correlation). The trouble is that if you do this with cov, the matrix will not be guaranteed to be positive definite. If you do cov2cor(cor.mat), you will find correlation coefficients outside of [-1,1]. Using pairwise with cor seems to handle this. Then I could use the diagonal variances to go from cor.mat to cov.mat. Still, this is probably not optimal.

There appears to be a few packages that claim to do this (mvnmle, rsem) but neither appear to work. rsem fails to run for me. mvnmle can only handle up to 50 variables. I need to handle roughly 1500 variables. Would like it to run in a few seconds.

Anyone know of a good package for this?

Best Answer

Here are some of the approaches that come to mind.

You can use nearPD function from the Matrix package to convert the matrix output from cov/cor to a positive definite matrix.

You can use an imputation package such as VIM to fill missing data and then use cov.

You can code up your own routine to compute cov estimate using Expectation Maximization under normality. For details you can refer to Rubin.