Solved – segmentation of univariate irregular time series

rregressionsegmented regressiontime seriestimeseries-segmentation

this is my first post. I have an irregular time series that exhibits large shifts in both mean and in the direction of the trend. It looks something like this (though this is far cleaner than reality):

set.seed(1)
x = sort(rep(seq(as.POSIXct('2015/01/01'),as.POSIXct('2015/01/10'),by='day'),100) + runif(1000,0,80000))
y = c(seq(300),seq(90,70,-.2),seq(20,50,.1),seq(238.5,90,-.5)) + runif(1000,50,80)
plot(x,y)

The task is to:
1. accurately partition/segment the data
2. extract the change point indices
3. fit regression separately for each segment

I have tried several routes, including:
a) hierarchical clustering based on dissimilarity matrix
b) function cpt.mean/var/meanvar from package changepoint (does not seem to work well)
c) function 'breakpoints' from package strucchange (slow and often inaccurate)
d) various types of kmeans (inappropriate, I know)

I have also explored some other packages, such as TSclust, urca, and DTW, but these seem better suited to clustering sets of time-series, not individual values within a time series.

Can someone point me in the correct direction? Perhaps I am not considering the appropriate data model?

UPDATE
Thank you all for your very considered responses. I went back to the strucchange package, and after some fiddling, have gotten it to work quite well. I had not initially appreciated the h 'minimal segment size' argument.
Finished product:
enter image description here

Best Answer

It is hard to say what exactly you did with the strucchange package which lead to inaccurate results. If I use a regression model with a simple linear time trend (as was used to generate the data), breakpoints() recovers the underlying structure. This is admittedly not very difficult here ... and breakpoints() is not terribly fast because it is written only in R. However, you could use the foreach plugin to speed it up somewhat on multiple cores (using doMC) or a cluster (using doSNOW).

In any case, to analyze your simple artificial data, I create a simple linear time trend auxiliary variable:

tr <- as.numeric(x - x[1], units = "days")

Then I try to find the breakpoints, yielding separate segments with different intercepts/slopes:

library("strucchange")
bp <- breakpoints(y ~ tr, h = 50, breaks = 8)
plot(bp)

breakpoints1

This shows that both the residual sum of squares and the BIC drop sharply up to 3 breakpoints. The BIC is still somewhat lower for 4 breakpoints so that this would be selected by default. However, 3 breaks would probably be enough here.

The fitted models for 3 (red) and 4 (blue) breaks are visualized by:

plot(x, y)
lines(x, fitted(bp, breaks = 3), col = 2, lwd = 2)
lines(x, fitted(bp, breaks = 4), col = 4, lwd = 2)

breakpoints2

The difference between 3 and 4 breaks is very small (albeit an improvement in terms of BIC). Hence, I would probably focus on the 3 breaks here, yielding intercepts/slopes close to the true values:

coef(bp, breaks = 3)
##               (Intercept)        tr
## 0.001 - 0.3      66.96409 100.28218
## 0.301 - 0.401   213.32564 -20.05220
## 0.402 - 0.702    41.83456  10.76780
## 0.703 - 1       641.02803 -48.51261

Unfortunately, breakpoints() and its methods does not deal with POSIXct time indexes, otherwise the annotation could be even nicer. You can extract the breakpoints, though, and match these to the underlying POSIXct scale:

breakpoints(bp, breaks = 3)     
##          Optimal 4-segment partition: 
## 
## Call:
## breakpoints.breakpointsfull(obj = bp, breaks = 3)
## 
## Breakpoints at observation number:
## 300 401 702 
## 
## Corresponding to breakdates:
## 0.3 0.401 0.702 

x[breakpoints(bp, breaks = 3)$breakpoints]
## [1] "2015-01-03 21:58:26 CET" "2015-01-05 00:23:56 CET"
## [3] "2015-01-08 00:49:45 CET"

The same could be done for the associated confidence intervals for the breakpoints:

confint(bp, breaks = 3)
##          Confidence intervals for breakpoints
##          of optimal 4-segment partition: 
## 
## Call:
## confint.breakpointsfull(object = bp, breaks = 3)
## 
## Breakpoints at observation number:
##   2.5 % breakpoints 97.5 %
## 1   299         300    301
## 2   399         401    402
## 3   701         702    703
## 
## Corresponding to breakdates:
##   2.5 % breakpoints 97.5 %
## 1 0.299       0.300  0.301
## 2 0.399       0.401  0.402
## 3 0.701       0.702  0.703
Related Question