Solved – segmentation of univariate irregular time series

rregressionsegmented regressiontime seriestimeseries-segmentation

this is my first post. I have an irregular time series that exhibits large shifts in both mean and in the direction of the trend. It looks something like this (though this is far cleaner than reality):

set.seed(1)
x = sort(rep(seq(as.POSIXct('2015/01/01'),as.POSIXct('2015/01/10'),by='day'),100) + runif(1000,0,80000))
y = c(seq(300),seq(90,70,-.2),seq(20,50,.1),seq(238.5,90,-.5)) + runif(1000,50,80)
plot(x,y)

The task is to:
1. accurately partition/segment the data
2. extract the change point indices
3. fit regression separately for each segment

I have tried several routes, including:
a) hierarchical clustering based on dissimilarity matrix
b) function cpt.mean/var/meanvar from package changepoint (does not seem to work well)
c) function 'breakpoints' from package strucchange (slow and often inaccurate)
d) various types of kmeans (inappropriate, I know)

I have also explored some other packages, such as TSclust, urca, and DTW, but these seem better suited to clustering sets of time-series, not individual values within a time series.

Can someone point me in the correct direction? Perhaps I am not considering the appropriate data model?

UPDATE
Thank you all for your very considered responses. I went back to the strucchange package, and after some fiddling, have gotten it to work quite well. I had not initially appreciated the h 'minimal segment size' argument.
Finished product:

Best Answer

It is hard to say what exactly you did with the strucchange package which lead to inaccurate results. If I use a regression model with a simple linear time trend (as was used to generate the data), breakpoints() recovers the underlying structure. This is admittedly not very difficult here ... and breakpoints() is not terribly fast because it is written only in R. However, you could use the foreach plugin to speed it up somewhat on multiple cores (using doMC) or a cluster (using doSNOW).

In any case, to analyze your simple artificial data, I create a simple linear time trend auxiliary variable:

tr <- as.numeric(x - x[1], units = "days")

Then I try to find the breakpoints, yielding separate segments with different intercepts/slopes:

library("strucchange")
bp <- breakpoints(y ~ tr, h = 50, breaks = 8)
plot(bp)

This shows that both the residual sum of squares and the BIC drop sharply up to 3 breakpoints. The BIC is still somewhat lower for 4 breakpoints so that this would be selected by default. However, 3 breaks would probably be enough here.

The fitted models for 3 (red) and 4 (blue) breaks are visualized by:

plot(x, y)
lines(x, fitted(bp, breaks = 3), col = 2, lwd = 2)
lines(x, fitted(bp, breaks = 4), col = 4, lwd = 2)

The difference between 3 and 4 breaks is very small (albeit an improvement in terms of BIC). Hence, I would probably focus on the 3 breaks here, yielding intercepts/slopes close to the true values:

coef(bp, breaks = 3)
##               (Intercept)        tr
## 0.001 - 0.3      66.96409 100.28218
## 0.301 - 0.401   213.32564 -20.05220
## 0.402 - 0.702    41.83456  10.76780
## 0.703 - 1       641.02803 -48.51261

Unfortunately, breakpoints() and its methods does not deal with POSIXct time indexes, otherwise the annotation could be even nicer. You can extract the breakpoints, though, and match these to the underlying POSIXct scale:

breakpoints(bp, breaks = 3)     
##          Optimal 4-segment partition: 
## 
## Call:
## breakpoints.breakpointsfull(obj = bp, breaks = 3)
## 
## Breakpoints at observation number:
## 300 401 702 
## 
## Corresponding to breakdates:
## 0.3 0.401 0.702 

x[breakpoints(bp, breaks = 3)$breakpoints]
## [1] "2015-01-03 21:58:26 CET" "2015-01-05 00:23:56 CET"
## [3] "2015-01-08 00:49:45 CET"

The same could be done for the associated confidence intervals for the breakpoints:

confint(bp, breaks = 3)
##          Confidence intervals for breakpoints
##          of optimal 4-segment partition: 
## 
## Call:
## confint.breakpointsfull(object = bp, breaks = 3)
## 
## Breakpoints at observation number:
##   2.5 % breakpoints 97.5 %
## 1   299         300    301
## 2   399         401    402
## 3   701         702    703
## 
## Corresponding to breakdates:
##   2.5 % breakpoints 97.5 %
## 1 0.299       0.300  0.301
## 2 0.399       0.401  0.402
## 3 0.701       0.702  0.703

Related Solutions

Solved – Asynchronous (irregular) Time Series Analysis

I know of one possible solution, but it is sufficiently complicated that I'm going to take the easy option and link you to the relevant academic paper (a critically under-rated paper in my opinion):

Frank de Jong, Theo Nijman (1997) "High Frequency Analysis of Lead-Lag Relationships Between Financial Markets"

I'm sure more work must have been done on this problem since then. A good way to find it is to use the "citations" page on ideas.repec. A link to the relevant page for the above-mentioned paper is here. A few titles look quite relevant.

Solved – Irregular Time Series

Yes, the problem is that auto.arima expects a ts time series - or an object that can be coerced to that. Internally, auto.arima coerces x <- as.ts(x). You can check what this looks like by:

as.ts(tt)
## Time Series:
## Start = 16563 
## End = 16654 
## Frequency = 1 
##  [1] 1494.50      NA 1367.57      NA      NA      NA      NA 1357.57      NA
## [10] 1222.23      NA      NA      NA      NA 1124.02      NA 1011.64      NA
## [19]      NA      NA      NA 4575.64 3201.87 3050.04 2173.38 1967.88 1838.55
## [28] 1666.05      NA 1656.05 1524.96      NA      NA      NA      NA      NA
## [37]  835.96      NA      NA      NA      NA      NA      NA  775.36  592.36
## [46]  494.15      NA      NA      NA 4058.15 2624.36 2448.47      NA      NA
## [55] 1598.47 1398.47 1264.14 1165.88 1053.67      NA      NA      NA      NA
## [64]      NA      NA  941.36      NA      NA      NA      NA  821.36      NA
## [73]  471.36      NA      NA  373.15      NA      NA      NA  259.91      NA
## [82]      NA      NA      NA 3808.91 2262.26 1940.39 1011.39  800.81      NA
## [91]      NA  790.81

Thus, zoo detects the underlying numeric time index corresponding to the Dates (number of days since 1970-01-01), creates a regular grid, and fills it with NAs.

What to do in this situation depends on whether this is your complete data and what the pattern is that you expect to be relevant. For example, one could try to coarsen the data to a regular grid or try to fit a continuous time model etc.

Best Answer

Related Solutions

Solved – Asynchronous (irregular) Time Series Analysis

Solved – Irregular Time Series

Related Question