Solved – Outlier Detection in Time-Series: How to reduce false positives

computational-statisticsoutlierstime series

I'm trying to automate outlier detection in time-series and I used a modification of the solution proposed by Rob Hyndman here.

Say, I measure daily visits to a website from various countries. For some countries where the daily visits are a few hundrends or thousands, my method seems to be working reasonably.

However, in cases where a country leads to only 1 or 2 visits per day, the limits of the algorithm are very narrow (e.g. 1 ± 0.001) and therefore the 2 visits are considered an outlier. How could I automatically detect such cases and how could I treat them to identify outliers? I wouldn't like to set a manual threshold of, say, 100 visits per day.

Thank you!

Best Answer

Don't expect much for small, discrete counts. Going from 1 to 2 visits is a 100% increase, and going from 0 to 1 visits is an infinite increase. At low levels you may be dealing with zero-inflated models, and it can be very noisy down there as well.

In my experience, count data with a mixture of large and small counts like this results in two problems with your small counts: 1) they are too coarse to do much with, 2) they are generated by different processes. (Think small, rural post office versus big city post office). So you need to at least split your modeling in two: do what you're successfully doing for the larger counts, and do something different -- coarser and more approximate -- with small counts. But don't expect much of the small counts.

The good news is that the big counts, by definition, include more of your transactions, so your better model covers more of the data, even though it may not cover most of your sites.

(I say "modeling" to be general, but of course outlier detection is assuming a particular model and finding points that are highly unlikely with that model's assumptions.)

Related Solutions

Solved – Outlier detection for generic time series

You are quite right that the ARIMA Model you are using (first differences) may not be appropriate to detect outliers. Outliers can be Pulses, Level Shifts, Seasonal Pulses or Local Time Trends. You might want to google "INTERVENTION DETECTION IN TIME SERIES" or google "AUTOMATIC INTERVENTION DETECTION" to get some reading matter on INTERVENTION DETECTION. Note that this is not the same as INTERVENTION MODELLING which often assumes the nature of the outlier and does not empirically identify same. Following mpkitas's remarks one would include the empirically identified outliers as dummy predictor series in order to accommodate their impact. A lot of work has been done in identifying oultliers using a null filter and then identifying the appropriate ARIMA Model. Some commercial packages assume that you identify the arima model first ( possibly flawed by the outliers ) and then identify the outliers. More general procedures examine both strategies. Your current procedure follows the "use up front filter first" approach but is also flawed by the assumption of the upfront filter.

Some more reflections: to detect an anomaly, you need a model which provides an expectation. Intervention Detection yields the answer to the question " What is the probability of observing what I observed before I observed it ? AN ARIMA model can then used to identify the "unusual" Time Series observations. The problem is that you can't catch an outlier without a model (at least a mild one) for your data. Else how would you know that a point violated that model? In fact, the process of growing understanding and finding and examining outliers must be iterative. This isn't a new thought. Bacon, writing in Novum Organum about 400 years ago said: "Errors of Nature, Sports and Monsters correct the understanding in regard to ordinary things,and reveal general forms. For whoever knows the ways of Nature will more easily notice her deviations; and, on the other hand, whoever knows her deviations will more accurately understand Nature, The Model you are imposing on all your series i clearly am inadequate way to go.

Solved – How to find patterns and identify changes in them in time series with R

You will need to consider 6 daily dummies, 11 monthly dummies, your ~10-15 holiday dummy variables. You will need to NOT consider any ARIMA as you want to rely more upon deterministic variables already listed. You will need to also consider trend(a dummy variable like 1,2,3,4,5,6etc, and perhaps changes in trend so there could be multiple so 0,0,0,0,1,2,3,4,5,,etc.), outliers, levels shifts, changes in seasonality (ie seasonal pulses as you very smartly point out that there are changes in the day of the week pattern!!!), lead and lag impacts around the holidays. There might also be day of the month variables, but we see that more with datasets tied to cash as payday is usually around the end of the month and middle of the month.

You would need to remove which dummy variables are insignificant. You can do a poor man's check when you are done to compare the coefficient in the model vs a % of the total to see if they make sense. For example if Monday contributes 50% of the overall volume then your Monday dummy should be POSITIVE and much larger than the other cooefficients.

Feel free to post your data and I would be glad to look at it. Just make sure to state the beginning observations date and the country where the data is from in order to bring in the appropriate holidays.

We have been working on time series(since 1975) and the issue of daily data(since 1998).

You will need to consider 6 daily dummies, 11 monthly dummies, your ~10-15 holiday dummy variables. You will need to NOT consider any ARIMA as you want to rely more upon deministic variables already listed. You will need to also consider trend(a dummy variable like 1,2,3,4,5,6etc, and perhaps changes in trend so there could be multiple so 0,0,0,0,1,2,3,4,5,,etc.), outliers, levels shifts, changes in seasonality (ie seasonal pulses as you very smartly point out that there are changes in the day of the week patn!!!), lead and lag impacts around the holidays. There might also be day of the month variables, but we see that more with datasets tied to cash as payday is usually around the end of the month and middle of the month.

We have been working on time series(since 1975) and the issue of daily data(since 1998).

Yes, I appreciate your goal of learning how to get do this, but the best I can do is this. Maybe you can reverse engineer?

Sorry for the delay!

Ok, we have analyzed your data and here are our findings. We reduced the data set to use the last 1,162 observations. The data begins on Monday 3/1/2010. The Monday start date is very important when inpreting the day of the week variables.While 6 years of data can be be helpful, in this case it is too much data as the data is so small at the beginning.

Here is a summary of the average and holidays:

The average demand is 211.

Let's review the holidays, there is a decrease in demand starting 4 days before Christmas of 48.29. Thanksgiving has an impact on the day of and the day af. Most holidays have a negative impact except St.Patrick's.

Y(T) = 211.59
+[X1(T)][(- 48.2988B*-4- 59.9340B*-3- 100.12 B*-2
- 238.07 B*-1- 150.08 - 352.52 B** 1)] M_CHRISTMAS +[X2(T)][(- 64.4805)] M_CINCODEMAYO +[X3(T)][(- 15.8391)] M_COLUMBUS +[X4(T)][(- 31.9900B*-3- 100.62 - 11.5661B* 1
- 27.8408B** 2- 14.4484B** 4)] M_GOODFRIDAY +[X5(T)][(- 14.9771B** 1)] M_FATHERSDAY +[X6(T)][(- 36.0068B** 1)] M_HALLOWEEN +[X7(T)][(- 195.05 - 103.42 B** 1)] M_JULY4TH +[X8(T)][(- 198.80 - 53.2202B** 1)] M_LABORDAY +[X9(T)][(- 28.3956- 29.5278B** 1)] M_MARDIGRAS +[X10(T)[(- 81.8183)] M_MARTINLKING +[X11(T)[(- 209.58 - 19.8445B** 1)] M_MEMORIALDAY +[X12(T)[(- 166.62 B*-4- 82.5935B*-3- 73.4411B*-2
- 218.53 B*-1- 117.53 - 115.39 B** 1)] M_NEWYEARS +[X13(T)[(- 113.83 - 12.5742B** 1)] M_PRESIDENTS +[X14(T)[(+ 13.2287B** 1)] M_STPATRICKS +[X15(T)[(- 37.4732B** 1)] M_STVALENTINES +[X16(T)[(- 244.69 - 206.13 B** 1)] M_THANKSGIVI +[X17(T)[(- 42.7715+ 17.9379B** 4+ 13.8972B** 5)] M_VEANSDAY

Autobox searches for impacts when Holidays land on a Monday or a Friday. The Monday_after a holiday on a Friday is negative 57.23. When there is a holiday on a Friday or Monday the weekend had a lower demand of 3. +[X18(T)[(- 57.2342)] MONDAY_AFTER +[X19(T)[(- 3.4639)] LONGWEEKEND

The month of the year pattern has February and March as the largest months and August as the lowest month. February is not significant so it is the same as the average. March is the intercept.

   +[X20(T)[(- 19.7274)]                                 MONTH_EFF04
   +[X21(T)[(- 47.0142)]                                 MONTH_EFF05
   +[X22(T)[(- 78.4654)]                                 MONTH_EFF06
   +[X23(T)[(- 88.7855)]                                 MONTH_EFF07
   +[X24(T)[(- 91.1418)]                                 MONTH_EFF08
   +[X25(T)[(- 84.4558)]                                 MONTH_EFF09
   +[X26(T)[(- 75.2718)]                                 MONTH_EFF10
   +[X27(T)[(- 65.9504)]                                 MONTH_EFF11
   +[X28(T)[(- 47.3812)]                                 MONTH_EFF12
   +[X29(T)[(- 14.9862)]                                 MONTH_EFF01

Saturdays are the lowest and Sundays(not shown as it is the intercept or average of 211.59) are at the average and Tuesdays and Wednesdays Remember that Monday was the first day of in the dataset so the first variable reflects Monday.

   +[X30(T)[(+  189.10    )]                             FIXED_EFF_N10107
   +[X31(T)[(+  232.88    )]                             FIXED_EFF_N10207
   +[X32(T)[(+  231.11    )]                             FIXED_EFF_N10307
   +[X33(T)[(+  219.69    )]                             FIXED_EFF_N10407
   +[X34(T)[(+  154.80    )]                             FIXED_EFF_N10507
   +[X35(T)[(- 30.4825)]                                 FIXED_EFF_N10607

Two time trends. The first begins at time period 1 and indicates an increase of volume each day by .752. The second trend is negative at -.630 and starts at period 583, but the in general the trend is still up (ie .752-.630=+.122).

   +[X36(T)[(+  .752)]                                   :TIME TREND        1                                  1/  1   3/ 1/2010   I~T00001__030110
   +[X37(T)[(-  .630)]                                   :TIME TREND      583                                 84/  2  10/ 4/2011   I~T00583__030110

There are 22 one-time (pulse) outliers and 3 level shifts(changes in the intercept) 9 seasonal pulses reflecting a change in the day of the week pattern.
It looks like day 6 and 7(sat and sun) have evolved to be lower a couple of times. There was a drop found on Saturdays beginning 1/15/2011, 1/29/2011, 3/5/2011, and 2/11/2012. Sundays also had some similar drops. Day 2(Tuesdays) also had an increase beginning 5/15/2012 of +27.8687.

Four level shifts occurred with a decrease of 4.44 beginning 10/25/2010, a decrease of 32.45 beginning 8/3/2011, a decrease of 64.65 beginning 4/13/2011 and an increase of 35 beginning 3/19/2012.

   +[X38(T)[(- 43.0325)]                                 :SEASONAL PULSE  713                                102/  6   2/11/2012   I~S00713__030110tet
   +[X39(T)[(- 42.1592)]                                 :SEASONAL PULSE  679                                 97/  7   1/ 8/2012   I~S00679__030110tet
   +[X40(T)[(-  354.57    )]                             :PULSE          1031                                148/  2  12/25/2012   I~P01031__030110tet
   +[X41(T)[(-  348.37    )]                             :PULSE          1038                                149/  2   1/ 1/2013   I~P01038__030110tet
   +[X42(T)[(-  231.82    )]                             :PULSE          1033                                148/  4  12/27/2012   I~P01033__030110tet
   +[X43(T)[(+  241.57    )]                             :PULSE           301                                 43/  7  12/26/2010   I~P00301__030110tet
   +[X44(T)[(+ 85.0799)]                                 :PULSE          1156                                166/  1   4/29/2013   I~P01156__030110tet
   +[X45(T)[(+  240.85    )]                             :PULSE           689                                 99/  3   1/18/2012   I~P00689__030110tet
   +[X46(T)[(+ 44.4059)]                                 :PULSE          1159                                166/  4   5/ 2/2013   I~P01159__030110tet
   +[X47(T)[(- 50.5678)]                                 :SEASONAL PULSE  329                                 47/  7   1/23/2011   I~S00329__030110tet
   +[X48(T)[(- 32.8224)]                                 :SEASONAL PULSE   28                                  4/  7   3/28/2010   I~S00028__030110tet
   +[X49(T)[(- 26.9859)]                                 :SEASONAL PULSE  370                                 53/  6   3/ 5/2011   I~S00370__030110tet
   +[X50(T)[(+ 27.8687)]                                 :SEASONAL PULSE  807                                116/  2   5/15/2012   I~S00807__030110tet
   +[X51(T)[(-  177.84    )]                             :PULSE           667                                 96/  2  12/27/2011   I~P00667__030110tet
   +[X52(T)[(+ 35.4320)]                                 :LEVEL SHIFT     750                                108/  1   3/19/2012   I~L00750__030110tet
   +[X53(T)[(- 64.6500)]                                 :LEVEL SHIFT     409                                 59/  3   4/13/2011   I~L00409__030110tet
   +[X54(T)[(- 4.4417)]                                  :LEVEL SHIFT     239                                 35/  1  10/25/2010   I~L00239__030110tet
   +[X55(T)[(+  173.26    )]                             :PULSE           585                                 84/  4  10/ 6/2011   I~P00585__030110tet
   +[X56(T)[(+  179.15    )]                             :PULSE           690                                 99/  4   1/19/2012   I~P00690__030110tet
   +[X57(T)[(-  107.75    )]                             :PULSE            24                                  4/  3   3/24/2010   I~P00024__030110tet
   +[X58(T)[(+  203.26    )]                             :PULSE           126                                 18/  7   7/ 4/2010   I~P00126__030110tet
   +[X59(T)[(+  208.55    )]                             :PULSE           664                                 95/  6  12/24/2011   I~P00664__030110tet
   +[X60(T)[(- 27.9535)]                                 :SEASONAL PULSE  335                                 48/  6   1/29/2011   I~S00335__030110tet
   +[X61(T)[(- 74.6895)]                                 :SEASONAL PULSE  322                                 46/  7   1/16/2011   I~S00322__030110tet
   +[X62(T)[(-  106.79    )]                             :PULSE          1030                                148/  1  12/24/2012   I~P01030__030110tet
   +[X63(T)[(- 33.8918)]                                 :PULSE          1158                                166/  3   5/ 1/2013   I~P01158__030110tet
   +[X64(T)[(- 32.0515)]                                 :PULSE          1161                                166/  6   5/ 4/2013   I~P01161__030110tet
   +[X65(T)[(- 32.4514)]                                 :LEVEL SHIFT     531                                 76/  6   8/13/2011   I~L00531__030110tet
   +[X66(T)[(+  137.78    )]                             :PULSE          1095                                157/  3   2/27/2013   I~P01095__030110tet
   +[X67(T)[(+  168.20    )]                             :PULSE          1128                                162/  1   4/ 1/2013   I~P01128__030110tet
   +[X68(T)[(-  127.34    )]                             :PULSE           633                                 91/  3  11/23/2011   I~P00633__030110tet
   +[X69(T)[(- 57.6397)]                                 :SEASONAL PULSE  321                                 46/  6   1/15/2011   I~S00321__030110tet
   +[X70(T)[(+  208.15    )]                             :PULSE           671                                 96/  6  12/31/2011   I~P00671__030110tet
   +[X71(T)[(-  124.65    )]                             :PULSE          1107                                159/  1   3/11/2013   I~P01107__030110tet
   +[X72(T)[(-  102.15    )]                             :PULSE           429                                 62/  2   5/ 3/2011   I~P00429__030110tet
  +                    +   [A(T)]

If you do simple math and look at the contribution of a month to total or the day to the total you will see that the coefficients are similar. See the XLS file for the check on this. It shouldn't be exact, but rather directional in nature and it is.

We did allow Autobox to search for arima, special days of the month as this is not CASH demand related to pay days, and expanded the number of outliers to be searched or to a max of 100 due to the largest sample size.

You can see the output from the Autobox run and the XLS file showing the "poor man's" model to compare to the coefficients from Autobox in Dropbox here https://www.dropbox.com/sh/fyd0lvbnjrlbwoz/M0sH1FFhTu

Best Answer

Related Solutions

Solved – Outlier detection for generic time series

Solved – How to find patterns and identify changes in them in time series with R

Related Question