Solved – Missing data imputation in time series in R

data preprocessinginterpolationmissing datartime series

I have got hourly temperature data from 2012 to 2016 as follows:

> head(htemp)
     HH_ID      TEMPERATURE   YY_ID    DD_ID       MM_ID
1 201201010000     8.98       2012    20120101     201201
2 201201010100     8.67       2012    20120101     201201
3 201201010200     8.69       2012    20120101     201201
4 201201010300     8.50       2012    20120101     201201
5 201201010400     8.30       2012    20120101     201201
6 201201010500     8.10       2012    20120101     201201

There are missing data in the data

  1. Missing data(NA) on individual hour (on 201209281400, 201209290000...)
  2. Missing data on consecutive hours(like no observation(NA) on 201210241700, 201210241800, 201210241900).
  3. No observation(NA) on a whole day(20130328).

I am wondering how to interpolate the missing data using adjacent data, i.e. linearly interpolation for individual missing HH data, and adopting the "typical" pattern from adjacent days for the whole day missing data (linearly interpolating each HH of the missing day using the temperature of corresponding HH in adjacent days).

Edit

I would like to perform the time series analysis on the temperature data, like decomposing (stl), modelling (auto.arima) and forecasting (forecast) it as well. It seems stl cannot handle missing data, so I think it might be necessary to impute the missing data first.

I have another data set containing electricity demand, where there is no missing data. I may also model the demand data using temperature data as covariate.

Update

I tried imp<-mice(htemp) on my data, but got an error:

iter imp variable
  1   1  TEMPERATURE
Error in solve.default(xtx + diag(pen)) : 
  system is computationally singular: reciprocal condition number = 5.03072e-28

Best Answer

First thing, a lot of imputation packages do not work with whole rows missing. (because their algorithms work on correlations between the variables - if there is no other variable in a row, there is no way to estimate the missing values)

You need imputation packages that work on time features.

You could use for example package imputeTS to impute the temperature.

library(imputeTS)
x <- ts(htemp$TEMPERATURE, frequency = 12)
x.withoutNA <- na_kalman(x)

This would be one possible solution of getting imputed temperature values.

Here another one with the forecast package:

library(forecast)
x <- ts(htemp$TEMPERATURE, frequency = 12)
x.withoutNA <- na.interp(x)

These packages actually work, because they work on time correlations of one attribute instead of inter-attribute correlations.

Related Question