Solved – interrupted time series with non-constant variance

rtime series

I am new to time series analysis and I've been learning the basics through a couple resources. I am working with monthly data with a known intervention date (mid-2014) and trying to determine if there was a significant change in the mean number of counts before/after the intervention.

The main issue is that my data does not have constant variance over time (just by eye-balling) and I do not know what steps should be taken for an appropriate model selection. A lot of the material I've read so far only explained the process for stationary time series but not for cases when the assumptions for stationary time series are violated. An image of my data is below.

enter image description here

I am working in R and any help or thoughts will be much appreciated!

Edit: Added the data, where month.counts is the counts (outcome of interest) per month, pre.post.int is a binary variable indicating the intervention (0 being pre-, 1 being post-intervention), ts.counts being a ts object in R, and df.counts is a data frame containing a date index, month.counts and pre.post.int as columns.

month.counts <- c("24", "22", "14", "12", "30", "4", "18", 
              "37", "5", "12", "18", "37", "24", "11", 
              "29", "31", "21", "29", "17", "22", "19", 
              "19", "9", "6", "18", "10", "33", "25", 
              "18", "10", "7", "11", "10", "26", "44", 
              "34", "15", "20", "30", "14", "18", "31", 
              "15", "19", "25", "19", "20", "12", "11", 
              "23", "19", "15", "28", "28", "22", "44", 
              "41", "38", "49", "66", "83", "66", "64", 
              "69", "82", "42", "54", "65", "86", "83", 
              "108", "68", "65", "60", "55", "45", "49", 
              "42", "62", "61", "67", "54", "53", "44")

pre.post.int <- c("0", "0", "0", "0", "0", "0", "0", "0", 
              "0", "0", "0", "0", "0", "0", "0", "0", 
              "0", "0", "0", "0", "0", "0", "0", "0", 
              "0", "0", "0", "0", "0", "0", "0", "0", 
              "0", "0", "0", "0", "0", "0", "0", "0", 
              "0", "0", "0", "0", "0", "0", "0", "0", 
              "0", "0", "0", "1", "1", "1", "1", "1", 
              "1", "1", "1", "1", "1", "1", "1", "1", 
              "1", "1", "1", "1", "1", "1", "1", "1", 
              "1", "1", "1", "1", "1", "1", "1", "1", 
              "1", "1", "1", "1")

ts.counts <- ts(data = month.counts, frequency = 12, start = c(2010,1))

df.counts <- data.frame(seq_along(month.counts), month.counts, pre.post.int)
colnames(df.counts) <- c("date.index", "counts", "pre.post")

Best Answer

Visualizing appropriate (i.e. minimally sufficient ) statistical/modelling remedies is fraught with possible error as the eye can be a poor man's statistical expert system. For example if you simply adjust two of the most recent set of values the constant error variance hypothesis is "more believable" . Gaussian violations can take the form of 1) the expected value is biased due to pulses/level shifts/seasonal pulses /time trends 2) the parameters are changing over time requiring segmentation or TAR models 3) the error variance is not homoscedastic (constant ) requiring either a power transform or generalized least squares employing weights. Good software/approaches essentially tries alternative gambits to render the final error process gaussian or at least not statistically different from gaussian by adhering to the principle "first do no harm"

There are solutions available in R to sort out the conundrum. One of them is AUTOBOX which conducts various tests to deal with/suggest appropriate remedies for problems like the one you proposing. I have been involved in leading the research into practical solutions in this area.

If you post your data , I will try and be of more help ( by example ).

EDITED AFTER RECEIPT OF DATA:

As I surmised , three unusual values caused your "eye" to conclude about the need for unwarranted complications. AUTOBOX identified/suggested the following fairly simple model as being adequate.enter image description here andenter image description here andenter image description here

The Actualenter image description here and Cleansed reflect the pulse adjustments for the three points.

The plot of the model's residuals suggest suffficiencyenter image description here confirmed by the acf of the residuals enter image description here

The Actual/Fit and Forecasts are here enter image description here

The OP suggested a LEVEL SHIFT starting at period 52 I introduced that variable as a possible predictor and obtained the following model enter image description here and enter image description here obtaining slightly different results but with similar forecasts and a "different set" of unusual values. Note that the RMSE for both models is quite similar.

Notice however that the model residual plot has "less clumpiness" suggesting possibly a better representation.

enter image description here

Related Question