Solved – How to decide what to do with missing data when doing data analysis

MATLABmissing data

I'm doing data analysis with MATLAB and I have a bunch of files with missing data here and there and somewhere even whole days or months worth of data is missing, because sensor devices have been fixed or changed or something like that.

My question is: How to decide what to do with missing data? When is it best to set missing data just equal to zero and when to use interpolation or something like that?

My professor advised me just to set the missing values equal zero, should I proceed this way?

edit: My data consists of weather data: nitrogen oxyde, nitrogen dioxide, temperature, wind, amount of rain etc.. My task is to forecast upcoming changes in the weather conditions. Here is a sample of my data values (-99 means error value)

5 3 0 6 2 0 0 -99 17 24 94 74 -99 -99 -99 -99 10 5 3 5 5 17 1 3 3 0 6 2 0 0 0 -99
13 25 50 35 19 8 7 3 3 4 4 6 6 7 2 3 0 1 0 3 0 0 0 -99 5 7 28 27 16 9 7 7 7 9 9 11 
6 12

This is just a small sample, I have lots of this data

Best Answer

Setting missing data = 0 cannot be right, unless there is something very unusual about your situation that you are not telling us.

The right thing to do depends on 1) Why the data are missing. Are they missing completely at random, missing at random, or not missing at random? 2) How much data are missing. 3) What analysis you plan to do.

MCAR means that the data are missing for reasons completely unrelated to the data themselves. MAR means that the reasons the data are missing are captured by data that you have. NMAR means neither of the above is true. In this situation, you have a complex problem.

If very little data are missing, you can use case deletion or mean substitution. If more data are missing and they are MCAR or MAR, one good method is multiple imputation.

Related Question