Solved – Handling large gaps of missing data in the dataset

data transformationdata-imputationdatasetmissing datamultiple-imputation

Suppose that I got a dataset that measures different daily meteorological data from one active weather station for a span of 16 years.

The structure of the data is as follows: all columns are the different weather and climate data, and all rows are all of the days in the dataset (think 16 * 365 days).

It just so happens that there are missing values in this dataset, most of them close to the beginning of the period (2000 – 2016). Closer inspection of the data reveals that the missing data are often in spans of time and not individual days (i.e. a week of missing data, a month, a few months etc.)

My professor suggest that I just do manual replacement of missing days with data from the preceding day, but that proves much of a challenge since, to me replacing a week's worth of missing values with values from the preceding week sounds weird to me.

I have found techniques such as imputation and the accounts of missing data mechanisms, but I'm not sure if they would suffice.

From my research, the data to be predicted is not MNAR, i may just be wrong though.

My real question is: Is manual replacement of values even worth it with such large gaps in the dataset?

Thanks to everyone who replies.

Best Answer

You probably need to do something about it, because using the data without dealing with it will make extremely strong assumptions (depending on the analysis method that might be missing completely at random, or that these days had an average of the whole year, or of the last non-missing day, all of which is most likely quwstionable).

If you know why the data are missing that would help (e.g. no record created if the count was zero, someone randomly lost the notepad or spilled coffee over it etc.). It would also help, if you had data from another source for the missing days, you could e.g. do some form of multiple imputation (you may have to create your own) across multiple data sources and the mussing data would be much likely to be missing at random conditional on the other data.

Related Question