Time-Series – Quantifying Intervention Effect in Time Series Analysis

forecastingintervention-analysistime series

How can the magnitude of an intervention be quantified in a segmented time series regression?

I am attempting to replicate the methodology of Decline in pneumonia admissions after routine childhood immunisation with pneumococcal conjugate vaccine in the USA: a time-series analysis. There are several other published papers with similar methodology where the methods chapter is not informative. All of these papers cite Wagner et al., however that paper describes a much simpler linear model.

I have a 120 month time series of count data that exhibits strong seasonal variation, has a baseline negative trend and a known public health intervention at $t = 72$ that would be expected to decrease the counts (increase the negative trend). I have fitted this with a negative binomial model with AR(2) errors and dummy variables for months, pre-intervention trend, an intervention indicator and post-intervention trend. I did this in R statistics with the function glm.nb from the MASS package.

#In R version 3.3.1 with packages dplyr, MASS
#generate a comparable time-series ts() 
base <- rnbinom(n = 120, size = 1400, prob = 0.5)
season <- rep(c(600, 400, 150, 0, -50, -80, -300, -600, 50, 100, 200, 300), 10)
ts <- ts(base + season, start = c(2000,1), end = c(2009,12), frequency = 12)
#generate the independant variables
month.f <- factor(rep(1:12, 10))
dummy.months <- model.matrix(~month.f +0)
require(dplyr); lag1 <- lag(ts); lag2 <- lag(lag(ts))
time.interv <- 72
pre.interv.trend <- c(1:time.interv, rep(0, 48))
interv.indicator <- c(rep(0, time.interv), rep(1, 48))
post.interv.trend <- c(rep(0, time.interv), 1:48)
df <- cbind.data.frame(ts, dummy.months,lag1,lag2,interv.indicator,pre.interv.trend,post.interv.trend)
#the fitted model
require(MASS); fit <- glm.nb(ts ~. + 0, data = df)

I have attempted several approaches

  1. I have tried using the forecast package to forecast the pre-intervention time series and then subtract the observed values from the predicted. However, the 95%CI intervals became so large that there would have been no theoretical way for the observed time series to fall outside them.
  2. I have refit the model without the intervention variables and subtracted fit_intervention from fit_nonintervention. The refitted model exhibits fairly similar fitted values with an overall decreased model fit.

Best Answer

In general, evaluation of pre-post effects in time-series analysis is called interrupted time series. This is a very general modeling approach that tests the strong hypothesis:

$\mathcal{H}_0: \mu_{ijt} = f_i(t)$ versus $\mathcal{H}_1 : \mu_{ijt} = f_i(t) + \beta(t)X_{ijt} $

Where $X_{ijt}$ is the the treatment assignment for individual $i$ at time $t$. The easiest example is treating $\beta$ as a constant function and $X_{ijt}$ as a 0,1 indicator for 0: pre-intervention 1: peri-or post-intervention. Even if the actual "effect" of the intervention is different than this, this test is powered to detect differences in many types of scenarios, for instance, if $\beta(t)$ is any non-zero function, then a working constant parameter $\beta$ will estimate a time-averaged positive response to intervention and is non-zero.

A challenge in time-series analysis of pre-post interventions is using a parametric modeling approach for the auto-correlation. With many replicates of time and function, one can decompose the trend into lagged effects, seasonal effects etc. This would obviate the need for autocorrelation in the error term. Therefore it is not necessary to use forecast, but the model itself directly predicts what would have been observed in the post-intervention time period.

Consider the famous Air Passengers data in the datasets package in R.

## construct an analytic dataset to predict time trend using auto-regressive and seasonal components
AirPassengers <- data.frame('flights'=as.numeric(AirPassengers))
AirPassengers$month <- factor(month.name, levels=month.name)
AirPassengers$year <- rep(1949:1960, each=12)
AirPassengers$lag <- c(NA, AirPassengers$flights[-nrow(AirPassengers)])

plot(AirPassengers$flights, type='l')

AirPassengers$fitted <- exp(predict(lm(log(flights) ~ month + year, data=AirPassengers)))
lines(AirPassengers$fitted, col='red')

It's obvious this provides an excellent prediction of the time based trends. If, though, you were interested in a test of hypothesis as to whether "flying increased" post, say, 1955, you can update the dataset to include a 0/1 indicator for whether or not the time period is post that point and test its significance in a linear model.

For example:

library(lmtest)
library(sandwich)
AirPassengers$post <- AirPassengers$year >= 1955
fit <- lm(log(flights) ~ month + year + post, data=AirPassengers)
coeftest(fit, vcov. = vcovHC)['postTRUE', ]

Gives me:

> coeftest(fit, vcov. = vcovHC)['postTRUE', ]
  Estimate Std. Error    t value   Pr(>|t|) 
0.03720327 0.01783242 2.08627126 0.03890842 

Which is a nice example of a spurious finding, and a statistically significant effect that isn't practically significant. A more general test could be had by allowing heterogeneity between the month specific effects.

nullmodel <- lm(log(flights) ~ month + year, data=AirPassengers)
fullmodel <- lm(log(flights) ~ post*month + year, data=AirPassengers)
waldtest(nullmodel, fullmodel, vcov=vcovHC, test='Chisq')

Both of these are examples of the general approach to "interrupted time series" for segmented regression. It is a loosely defined term and I'm a little disappointed with how little detail the authors use in describing their exact approach in most cases.