Solved – n advantage to using moving average versus removing outliers

I have a dataset and for each hour there is 3 readings (sometimes missing and sometimes clearly an outlier). I am trying to find the mean of the entire dataset for the parameter. It has been suggested to me to take the mean of each hour and then the mean of the entirety of those means as this will help minimize the contribution of single outliers AND it was suggested to me to take the mean of the entire set and ignore the time.

Those two methods seem to have very similar results and similar standard deviations. However, if I trim out the outliers some datasets are significantly different than this (these are usually one observation that is more than 4 times the others etc and I think due to an observation error) so I think removing these outliers would be a good thing. So what is the advantage of a moving average? Is it inapropriately used here or am I misunderstanding it?

My dataset sort of looks like below:

Hour| observation

0 | 5

0 | 6

0 | 5.6

1 | .

1 | 4

1 | 4.8

2 | 5.1

2 | 5.4

2 | 498
…..

Best Answer

Removing 'outliers' is a good idea IF the outliers are properly described as erroneous readings, mistakes or they are known to be misleading. However, to ascribe one of those properties to the outliers requires something other than just inspection of the data. Were the three readings at each time point intended to allow assessment of whether the measurement 'worked' or are they intended simply to reduce the influence of measurement variability? If it was the former then you should be considering the known properties of the test and measurement system when deciding what causes outliers. Missing values and odd things like the decimal point without numerals may be indicative of a 'fragile' measurement or recording system that can be expected to yield outliers that are mistakes.

If a four times range for repeated measurements is plausibly just a result of the underlying variability of the values or of the measurement method then do not remove them. Perhaps the median of the distribution, which is less strongly affected by outliers, will serve instead of the mean.

If you have lots of data then you are in a good position to work out whether the outliers are bad values, because you can plot the distribution of the readings.

Best Answer

Related Solutions

Solved – R package TSA: how to interpret the IO coefficients output of the arimax function

Solved – Mean vs. Trimmed mean in the normal distribution

Related Question