Solved – Outlier Detection in Time-Series: How to reduce false positives

computational-statisticsoutlierstime series

I'm trying to automate outlier detection in time-series and I used a modification of the solution proposed by Rob Hyndman here.

Say, I measure daily visits to a website from various countries. For some countries where the daily visits are a few hundrends or thousands, my method seems to be working reasonably.

However, in cases where a country leads to only 1 or 2 visits per day, the limits of the algorithm are very narrow (e.g. 1 ± 0.001) and therefore the 2 visits are considered an outlier. How could I automatically detect such cases and how could I treat them to identify outliers? I wouldn't like to set a manual threshold of, say, 100 visits per day.

Thank you!

Best Answer

Don't expect much for small, discrete counts. Going from 1 to 2 visits is a 100% increase, and going from 0 to 1 visits is an infinite increase. At low levels you may be dealing with zero-inflated models, and it can be very noisy down there as well.

In my experience, count data with a mixture of large and small counts like this results in two problems with your small counts: 1) they are too coarse to do much with, 2) they are generated by different processes. (Think small, rural post office versus big city post office). So you need to at least split your modeling in two: do what you're successfully doing for the larger counts, and do something different -- coarser and more approximate -- with small counts. But don't expect much of the small counts.

The good news is that the big counts, by definition, include more of your transactions, so your better model covers more of the data, even though it may not cover most of your sites.

(I say "modeling" to be general, but of course outlier detection is assuming a particular model and finding points that are highly unlikely with that model's assumptions.)

Related Question