Solved – Detecting a step change in time ordered data

anovachange pointregressiontime series

Suppose I have data which looks like this:

enter image description here

dat <- data.frame(t = c(rep(0,30), rep(5,30), rep(10,30), rep(15,30), rep(20,30)),
                  y = c(rnorm(30, 5, .5), rnorm(30, 4, .5), rnorm(30, 3, .5), rnorm(30, 2, .5), rnorm(30, 1, .5)))

Which has a sequence of treatments which have a set order (e.g. lab rats exposed to the same chemical at different concentrations or long-distance runners placed in hyperbaric chambers for increasing periods of time). What is the best way to determine whether there is a linear relationship or a step change at a critical point as there is in this second dataset?

enter image description here

dat2 <- dat
change <- which(dat2$t %in% 10)
dat2$y[change] <- dat2$y[change] + 1
change <- which(dat2$t %in% 15)
dat2$y[change] <- dat2$y[change] - 1

How can you tell when the step change is significant with enough confidence to say one way or the other? Or at least put a probability on the two possibilities.

Ideally I'd like only these two possibilities to be possible so that if something else with obvious outliers appeared in the data like this:

enter image description here

dat3 <- dat 
change <- which(dat3$t %in% 0)
dat3$y[change] <- dat3$y[change] - 3

the obvious linear relationship would continue to be detected.

I have seen this question and note that changepoint analysis is probably not appropriate as there are multiple data points per treatment and not many treatments compared to a time series. I guess straightforward ANOVA or regression could be applied to identify a difference but I don't think it would identify a single step change. Although i'm happy to be proved wrong.

Best Answer

The changepoint methods such as the strucchange package assume that the series occur in an ordered manner. Thus even though you have several observations at a single time point, the order in which they occur in your data table matters. Having said this the strucchange package works well on the series above identifying the correct (in some sense) observations. Strucchange doesn't work on data tables so converting to a matrix we get

set.seed(10)
dat <- matrix(c(rep(0,30), rep(5,30), rep(10,30), rep(15,30), rep(20,30), rnorm(30, 5, .5), rnorm(30, 4, .5), rnorm(30, 3, .5), rnorm(30, 2, .5), rnorm(30, 1, .5)),ncol=2)
library(strucchange)
breakpoints(dat[,2]~dat[,1])

No changepoint is identified. Now for dat2:

dat2 <- dat
change <- which(dat2[,1] %in% 10)
dat2[change,2] <- dat2[change,2] + 1
change <- which(dat2[,1] %in% 15)
dat2[change,2] <- dat2[change,2] - 1
breakpoints(dat2[,2]~dat2[,1])

Optimal is 2 changepoints at 30 and 90 (0-30 is the slightly higher mean of 5 in t=0, 31-90 have mean 4 and 91-150 have mean 1).

Now for dat3:

dat3 <- dat 
change <- which(dat3[,1] %in% 0)
dat3[change,2] <- dat3[change,2] - 3
breakpoints(dat3[,2]~dat3[,1])

Here we have 1 changepoint at 59 which suggests that the first two form an upward trend followed by the remaining which form a downward trend.

This behaviour of the incorrect changepoint is due to the fact that only have 1 time observation before the change. Modifying this to 2 we estimate correctly:

dat4=rbind(matrix(c(rep(-5,30),rnorm(30,2,0.5)),ncol=2),dat3)
breakpoints(dat4[,2]~dat4[,1])

1 changepoint at 60.

You are trying to fit a change in the regression relationship. In some sense, repeated observations are not an issue, this gives you are handle on the variability. What is most important is that you have enough information to estimate the relationship prior to the change and also after the change as dat3 and dat4 demonstrate (albeit simply).