Solved – Covariance matrix for missing data

covariancecovariance-matrixestimationmissing datasvd

I am trying to understand the mathematics behind estimating the covariance matrix for a set of observations with missing data entries (or NaN).

I would like to do this without deleting rows with missing entries or without using post-hoc smoothing to ensure that the covariance matrix is positive semi-definite. How might I do this?

I know that one method would be imputation (Missing data and covariate analysis), but what other methods are there. Thanks a lot for any insight!

Best Answer

Another approach is to compute the maximum likelihood mean and covariance matrix, given all observed data. This requires an iterative algorithm, such as the expectation maximization algorithm. Accelerated variants and other types of optimization algorithms exist too. Compared to imputation, this approach can produce estimators that are more efficient, and unbiased under a wider variety of settings. It does require that the missingness of data is independent of the missing values (i.e. the data is 'missing completely at random' or 'missing at random').

References:

Jamshidian and Bentler (1999). ML Estimation of Mean and Covariance structures with Missing Data Using Complete Data Routines.

Little and Rubin (1987). Statistical Analysis with Missing Data. [Particularly chapter 8]

Related Question