Solved – Pearson correlation with missing values

correlationinterpolationmissing datapearson-rregression

I am trying to correlate dendrochronological data with climate data. The first one is acquired directly from trees, the second one from various stations from around the world. According to the formula of Pearson correlation, two sets of values must be of the same size. But the climate data is not always complete – e.g. temperature might have not been collected on a given day 100 years ago.

What should I do in such a situation?

I had two ideas. Interpolate missing values or omit the incomplete pair. I don't want to do the first one as it artificially creates values which might not be true. But can I do the second one?

I am not a mathematician and I'm not sure whether it is a viable option. Also, if you had any sources to back your answers up, I'd appreciate it as well.

Best Answer

Imputation (what you are calling interpolation) is widely used to handle missing data. You will obtain good estimates of Pearson correlation using (flexible) mean imputation. However, to estimate standard errors you will have to use multiple imputation. Omitting incomplete pairs is called a complete case analysis, and while inefficient, can work decently well. One must assume that the nature of the missingness doesn't depend on unmeasured values, like the outcome itself, for these approaches to be valid.