Solved – Outlier detection/imputation – discussion

data-imputationoutlierssplines

Introduction:

I'm working with heart rate data. IBI (InterBeat Interval) is defined as the time period between any two consecutive heart beat and is usually measured in millisecond. I have followed a subject for 6 days and using a device, I have all his IBI measures. My dataset has about 450,000 IBI measure. The device I'm using to measure IBI's is not 100% accurate. Therefore, I may get IBI measures that our out of range and are not consistent with previous and future measures. Consider the data of: 1500,580, 590, 570, 580, 1450, 560, 590, ... for sure, 1450 is an outlier.

Question:

I'm trying to find a reasonable way to:

  1. Detect outliers (like 1500, 1450 in the example above)
  2. Impute a reasonable value for detected outliers

What I have done so far:

My idea is to fit a cubic natural spline to the data. Then, any point outside of the 95% confidence interval around the N.S. fit, is considered as outlier. I can then impute for the outliers by using the fitted value from the N.S. fit.

Issue:

Consider the data above, the outliers 1500 and 1450 have substantial effect on the N.S fit. They might even cause the true values right next to them (580, 590, …) being marked as outlier. This means that even those values are valid, they need to be imputed with fitted value which in case are lot larger than those valid values.

How can I solve the issue above?

Is it reasonable to first fit an initial N.S. fit and mark all outliers. Then, fit a secondary N.S. (natural spline) fit with all outliers excluded from the dataset and use this secondary fit to impute for outliers?

Would you have any better idea?

Thanks a lot for your help. I believe this message can initiate a very good discussion on outlier detection/imputation methods.

Best Answer

I think it would be better to justify an outlier rule on something other than statistics; or, at least, something other than statistics alone.

We know a lot about heart beats. Use that knowledge to generate a rule. You say your device is not 100% accurate - well, no device is. But what is known about the inaccuracies of the device you are using?

If values are not impossible or demonstrably erroneous, don't exclude them just because they are unusual. Investigate them. Unless the device sometimes spews out random numbers (e.g. it reports 1500 instead of 250) , the numbers you get reflect reality, by throwing them out you are losing information and distorting the data.

You can, however, use statistical methods that work with fat-tailed distributions.

Related Question