Solved – How to find patterns and identify changes in them in time series with R

rtime series

This is my first question on stats, just trying to learn the basics of time series analysis with R. So any good suggestions about learning resources will be highly appreciated as well as the answer to the question.

For the data below, let's say it represents number of website visits per day, I would like to find out:

  1. What the weekly pattern is (e.g. the highest number of visits
    occurs on Thursdays, the lowest on Fridays etc.)

  2. Automatically detect changes in that pattern (e.g. in 2008 most of the visits
    occur on Thursdays, but the in 2009-01-04 the pattern changes to
    something else)

Please let me know if I can provide more details.

> str(daily)
An ‘xts’ object on 2007-02-19 23:32:16/2013-05-05 15:09:17 containing:
  Data: num [1:2268, 1] 55 32 70 48 75 50 48 46 36 55 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr "cnt"
  Indexed by objects of class: [POSIXct,POSIXt] TZ: 
  xts Attributes:  
 NULL

> head(daily)
                    cnt
2007-02-19 23:32:16  55
2007-02-20 23:58:58  32
2007-02-21 23:40:41  70
2007-02-22 23:01:41  48
2007-02-23 23:53:06  75
2007-02-24 23:47:07  50

plot(daily)

PlotDaily

Full dataset: https://dl.dropboxusercontent.com/u/65347419/daily.csv

Best Answer

You will need to consider 6 daily dummies, 11 monthly dummies, your ~10-15 holiday dummy variables. You will need to NOT consider any ARIMA as you want to rely more upon deterministic variables already listed. You will need to also consider trend(a dummy variable like 1,2,3,4,5,6etc, and perhaps changes in trend so there could be multiple so 0,0,0,0,1,2,3,4,5,,etc.), outliers, levels shifts, changes in seasonality (ie seasonal pulses as you very smartly point out that there are changes in the day of the week pattern!!!), lead and lag impacts around the holidays. There might also be day of the month variables, but we see that more with datasets tied to cash as payday is usually around the end of the month and middle of the month.

You would need to remove which dummy variables are insignificant. You can do a poor man's check when you are done to compare the coefficient in the model vs a % of the total to see if they make sense. For example if Monday contributes 50% of the overall volume then your Monday dummy should be POSITIVE and much larger than the other cooefficients.

Feel free to post your data and I would be glad to look at it. Just make sure to state the beginning observations date and the country where the data is from in order to bring in the appropriate holidays.

We have been working on time series(since 1975) and the issue of daily data(since 1998).

You will need to consider 6 daily dummies, 11 monthly dummies, your ~10-15 holiday dummy variables. You will need to NOT consider any ARIMA as you want to rely more upon deministic variables already listed. You will need to also consider trend(a dummy variable like 1,2,3,4,5,6etc, and perhaps changes in trend so there could be multiple so 0,0,0,0,1,2,3,4,5,,etc.), outliers, levels shifts, changes in seasonality (ie seasonal pulses as you very smartly point out that there are changes in the day of the week patn!!!), lead and lag impacts around the holidays. There might also be day of the month variables, but we see that more with datasets tied to cash as payday is usually around the end of the month and middle of the month.

You would need to remove which dummy variables are insignificant. You can do a poor man's check when you are done to compare the coefficient in the model vs a % of the total to see if they make sense. For example if Monday contributes 50% of the overall volume then your Monday dummy should be POSITIVE and much larger than the other cooefficients.

Feel free to post your data and I would be glad to look at it. Just make sure to state the beginning observations date and the country where the data is from in order to bring in the appropriate holidays.

We have been working on time series(since 1975) and the issue of daily data(since 1998).

Yes, I appreciate your goal of learning how to get do this, but the best I can do is this. Maybe you can reverse engineer?

Sorry for the delay!

Ok, we have analyzed your data and here are our findings. We reduced the data set to use the last 1,162 observations. The data begins on Monday 3/1/2010. The Monday start date is very important when inpreting the day of the week variables.While 6 years of data can be be helpful, in this case it is too much data as the data is so small at the beginning.

Here is a summary of the average and holidays:

The average demand is 211.

Let's review the holidays, there is a decrease in demand starting 4 days before Christmas of 48.29. Thanksgiving has an impact on the day of and the day af. Most holidays have a negative impact except St.Patrick's.

Y(T) = 211.59
+[X1(T)][(- 48.2988B*-4- 59.9340B*-3- 100.12 B*-2
- 238.07 B
*-1- 150.08 - 352.52 B** 1)] M_CHRISTMAS +[X2(T)][(- 64.4805)] M_CINCODEMAYO +[X3(T)][(- 15.8391)] M_COLUMBUS +[X4(T)][(- 31.9900B*-3- 100.62 - 11.5661B* 1
- 27.8408B** 2- 14.4484B** 4)] M_GOODFRIDAY +[X5(T)][(- 14.9771B** 1)] M_FATHERSDAY +[X6(T)][(- 36.0068B** 1)] M_HALLOWEEN +[X7(T)][(- 195.05 - 103.42 B** 1)] M_JULY4TH +[X8(T)][(- 198.80 - 53.2202B** 1)] M_LABORDAY +[X9(T)][(- 28.3956- 29.5278B** 1)] M_MARDIGRAS +[X10(T)[(- 81.8183)] M_MARTINLKING +[X11(T)[(- 209.58 - 19.8445B** 1)] M_MEMORIALDAY +[X12(T)[(- 166.62 B*-4- 82.5935B*-3- 73.4411B*-2
- 218.53 B
*-1- 117.53 - 115.39 B** 1)] M_NEWYEARS +[X13(T)[(- 113.83 - 12.5742B** 1)] M_PRESIDENTS +[X14(T)[(+ 13.2287B** 1)] M_STPATRICKS +[X15(T)[(- 37.4732B** 1)] M_STVALENTINES +[X16(T)[(- 244.69 - 206.13 B** 1)] M_THANKSGIVI +[X17(T)[(- 42.7715+ 17.9379B** 4+ 13.8972B** 5)] M_VEANSDAY

Autobox searches for impacts when Holidays land on a Monday or a Friday. The Monday_after a holiday on a Friday is negative 57.23. When there is a holiday on a Friday or Monday the weekend had a lower demand of 3. +[X18(T)[(- 57.2342)] MONDAY_AFTER +[X19(T)[(- 3.4639)] LONGWEEKEND

The month of the year pattern has February and March as the largest months and August as the lowest month. February is not significant so it is the same as the average. March is the intercept.

   +[X20(T)[(- 19.7274)]                                 MONTH_EFF04
   +[X21(T)[(- 47.0142)]                                 MONTH_EFF05
   +[X22(T)[(- 78.4654)]                                 MONTH_EFF06
   +[X23(T)[(- 88.7855)]                                 MONTH_EFF07
   +[X24(T)[(- 91.1418)]                                 MONTH_EFF08
   +[X25(T)[(- 84.4558)]                                 MONTH_EFF09
   +[X26(T)[(- 75.2718)]                                 MONTH_EFF10
   +[X27(T)[(- 65.9504)]                                 MONTH_EFF11
   +[X28(T)[(- 47.3812)]                                 MONTH_EFF12
   +[X29(T)[(- 14.9862)]                                 MONTH_EFF01

Saturdays are the lowest and Sundays(not shown as it is the intercept or average of 211.59) are at the average and Tuesdays and Wednesdays Remember that Monday was the first day of in the dataset so the first variable reflects Monday.

   +[X30(T)[(+  189.10    )]                             FIXED_EFF_N10107
   +[X31(T)[(+  232.88    )]                             FIXED_EFF_N10207
   +[X32(T)[(+  231.11    )]                             FIXED_EFF_N10307
   +[X33(T)[(+  219.69    )]                             FIXED_EFF_N10407
   +[X34(T)[(+  154.80    )]                             FIXED_EFF_N10507
   +[X35(T)[(- 30.4825)]                                 FIXED_EFF_N10607

Two time trends. The first begins at time period 1 and indicates an increase of volume each day by .752. The second trend is negative at -.630 and starts at period 583, but the in general the trend is still up (ie .752-.630=+.122).

   +[X36(T)[(+  .752)]                                   :TIME TREND        1                                  1/  1   3/ 1/2010   I~T00001__030110
   +[X37(T)[(-  .630)]                                   :TIME TREND      583                                 84/  2  10/ 4/2011   I~T00583__030110

There are 22 one-time (pulse) outliers and 3 level shifts(changes in the intercept) 9 seasonal pulses reflecting a change in the day of the week pattern.
It looks like day 6 and 7(sat and sun) have evolved to be lower a couple of times. There was a drop found on Saturdays beginning 1/15/2011, 1/29/2011, 3/5/2011, and 2/11/2012. Sundays also had some similar drops. Day 2(Tuesdays) also had an increase beginning 5/15/2012 of +27.8687.

Four level shifts occurred with a decrease of 4.44 beginning 10/25/2010, a decrease of 32.45 beginning 8/3/2011, a decrease of 64.65 beginning 4/13/2011 and an increase of 35 beginning 3/19/2012.

   +[X38(T)[(- 43.0325)]                                 :SEASONAL PULSE  713                                102/  6   2/11/2012   I~S00713__030110tet
   +[X39(T)[(- 42.1592)]                                 :SEASONAL PULSE  679                                 97/  7   1/ 8/2012   I~S00679__030110tet
   +[X40(T)[(-  354.57    )]                             :PULSE          1031                                148/  2  12/25/2012   I~P01031__030110tet
   +[X41(T)[(-  348.37    )]                             :PULSE          1038                                149/  2   1/ 1/2013   I~P01038__030110tet
   +[X42(T)[(-  231.82    )]                             :PULSE          1033                                148/  4  12/27/2012   I~P01033__030110tet
   +[X43(T)[(+  241.57    )]                             :PULSE           301                                 43/  7  12/26/2010   I~P00301__030110tet
   +[X44(T)[(+ 85.0799)]                                 :PULSE          1156                                166/  1   4/29/2013   I~P01156__030110tet
   +[X45(T)[(+  240.85    )]                             :PULSE           689                                 99/  3   1/18/2012   I~P00689__030110tet
   +[X46(T)[(+ 44.4059)]                                 :PULSE          1159                                166/  4   5/ 2/2013   I~P01159__030110tet
   +[X47(T)[(- 50.5678)]                                 :SEASONAL PULSE  329                                 47/  7   1/23/2011   I~S00329__030110tet
   +[X48(T)[(- 32.8224)]                                 :SEASONAL PULSE   28                                  4/  7   3/28/2010   I~S00028__030110tet
   +[X49(T)[(- 26.9859)]                                 :SEASONAL PULSE  370                                 53/  6   3/ 5/2011   I~S00370__030110tet
   +[X50(T)[(+ 27.8687)]                                 :SEASONAL PULSE  807                                116/  2   5/15/2012   I~S00807__030110tet
   +[X51(T)[(-  177.84    )]                             :PULSE           667                                 96/  2  12/27/2011   I~P00667__030110tet
   +[X52(T)[(+ 35.4320)]                                 :LEVEL SHIFT     750                                108/  1   3/19/2012   I~L00750__030110tet
   +[X53(T)[(- 64.6500)]                                 :LEVEL SHIFT     409                                 59/  3   4/13/2011   I~L00409__030110tet
   +[X54(T)[(- 4.4417)]                                  :LEVEL SHIFT     239                                 35/  1  10/25/2010   I~L00239__030110tet
   +[X55(T)[(+  173.26    )]                             :PULSE           585                                 84/  4  10/ 6/2011   I~P00585__030110tet
   +[X56(T)[(+  179.15    )]                             :PULSE           690                                 99/  4   1/19/2012   I~P00690__030110tet
   +[X57(T)[(-  107.75    )]                             :PULSE            24                                  4/  3   3/24/2010   I~P00024__030110tet
   +[X58(T)[(+  203.26    )]                             :PULSE           126                                 18/  7   7/ 4/2010   I~P00126__030110tet
   +[X59(T)[(+  208.55    )]                             :PULSE           664                                 95/  6  12/24/2011   I~P00664__030110tet
   +[X60(T)[(- 27.9535)]                                 :SEASONAL PULSE  335                                 48/  6   1/29/2011   I~S00335__030110tet
   +[X61(T)[(- 74.6895)]                                 :SEASONAL PULSE  322                                 46/  7   1/16/2011   I~S00322__030110tet
   +[X62(T)[(-  106.79    )]                             :PULSE          1030                                148/  1  12/24/2012   I~P01030__030110tet
   +[X63(T)[(- 33.8918)]                                 :PULSE          1158                                166/  3   5/ 1/2013   I~P01158__030110tet
   +[X64(T)[(- 32.0515)]                                 :PULSE          1161                                166/  6   5/ 4/2013   I~P01161__030110tet
   +[X65(T)[(- 32.4514)]                                 :LEVEL SHIFT     531                                 76/  6   8/13/2011   I~L00531__030110tet
   +[X66(T)[(+  137.78    )]                             :PULSE          1095                                157/  3   2/27/2013   I~P01095__030110tet
   +[X67(T)[(+  168.20    )]                             :PULSE          1128                                162/  1   4/ 1/2013   I~P01128__030110tet
   +[X68(T)[(-  127.34    )]                             :PULSE           633                                 91/  3  11/23/2011   I~P00633__030110tet
   +[X69(T)[(- 57.6397)]                                 :SEASONAL PULSE  321                                 46/  6   1/15/2011   I~S00321__030110tet
   +[X70(T)[(+  208.15    )]                             :PULSE           671                                 96/  6  12/31/2011   I~P00671__030110tet
   +[X71(T)[(-  124.65    )]                             :PULSE          1107                                159/  1   3/11/2013   I~P01107__030110tet
   +[X72(T)[(-  102.15    )]                             :PULSE           429                                 62/  2   5/ 3/2011   I~P00429__030110tet
  +                    +   [A(T)]

If you do simple math and look at the contribution of a month to total or the day to the total you will see that the coefficients are similar. See the XLS file for the check on this. It shouldn't be exact, but rather directional in nature and it is.

We did allow Autobox to search for arima, special days of the month as this is not CASH demand related to pay days, and expanded the number of outliers to be searched or to a max of 100 due to the largest sample size.

You can see the output from the Autobox run and the XLS file showing the "poor man's" model to compare to the coefficients from Autobox in Dropbox here https://www.dropbox.com/sh/fyd0lvbnjrlbwoz/M0sH1FFhTu

Related Question