this is my first post. I have an irregular time series that exhibits large shifts in both mean and in the direction of the trend. It looks something like this (though this is far cleaner than reality):
set.seed(1)
x = sort(rep(seq(as.POSIXct('2015/01/01'),as.POSIXct('2015/01/10'),by='day'),100) + runif(1000,0,80000))
y = c(seq(300),seq(90,70,-.2),seq(20,50,.1),seq(238.5,90,-.5)) + runif(1000,50,80)
plot(x,y)
The task is to:
1. accurately partition/segment the data
2. extract the change point indices
3. fit regression separately for each segment
I have tried several routes, including:
a) hierarchical clustering based on dissimilarity matrix
b) function cpt.mean/var/meanvar from package changepoint (does not seem to work well)
c) function 'breakpoints' from package strucchange (slow and often inaccurate)
d) various types of kmeans (inappropriate, I know)
I have also explored some other packages, such as TSclust, urca, and DTW, but these seem better suited to clustering sets of time-series, not individual values within a time series.
Can someone point me in the correct direction? Perhaps I am not considering the appropriate data model?
UPDATE
Thank you all for your very considered responses. I went back to the strucchange package, and after some fiddling, have gotten it to work quite well. I had not initially appreciated the h 'minimal segment size' argument.
Finished product:
Best Answer
It is hard to say what exactly you did with the
strucchange
package which lead to inaccurate results. If I use a regression model with a simple linear time trend (as was used to generate the data),breakpoints()
recovers the underlying structure. This is admittedly not very difficult here ... andbreakpoints()
is not terribly fast because it is written only in R. However, you could use theforeach
plugin to speed it up somewhat on multiple cores (usingdoMC
) or a cluster (usingdoSNOW
).In any case, to analyze your simple artificial data, I create a simple linear time trend auxiliary variable:
Then I try to find the breakpoints, yielding separate segments with different intercepts/slopes:
This shows that both the residual sum of squares and the BIC drop sharply up to 3 breakpoints. The BIC is still somewhat lower for 4 breakpoints so that this would be selected by default. However, 3 breaks would probably be enough here.
The fitted models for 3 (red) and 4 (blue) breaks are visualized by:
The difference between 3 and 4 breaks is very small (albeit an improvement in terms of BIC). Hence, I would probably focus on the 3 breaks here, yielding intercepts/slopes close to the true values:
Unfortunately,
breakpoints()
and its methods does not deal withPOSIXct
time indexes, otherwise the annotation could be even nicer. You can extract the breakpoints, though, and match these to the underlying POSIXct scale:The same could be done for the associated confidence intervals for the breakpoints: