Solved – Changepoint/Step Detection in Univariate Time Series

change pointpythontime series

As a beginner to time series analysis, I'm trying to understand the best way of detecting the points at which my univariate time series shows a change in trend direction (see highlighted example).

I believe these are known as 'changepoints' and/or 'step changes' (I'm not totally sure if these two terms mean the same thing, and if not, which one of these I'm trying to find)?

I've had a go at doing some simple window-based thresholding using first-order difference of the time series, where I check to see if the range within the window is greater than a certain percentage of the range across the entire dataset, but whilst this works well on this particular dataset, it is susceptible to noise (tested against other datasets).

I've noted that the spikes in the rolling standard deviation correlates with the observed change points (this makes sense to me), but I'm not sure if/how I could utilise this to produce a more robust detection solution?

Solutions in Python would be preferred, but even just theory suggestions would be appreciated!

Best Answer

Some test data that has some similar properties, code is in R:

set.seed(1)
a=rep(c(1,5,9,15),each=250)
x=1:1000
y=a+-0.02*x+rnorm(1000,sd=0.4)

To analyze this: library(EnvCpt) out=envcpt(y,models="trendcpt") cpts(out$trendcpt) # gives changes at 250, 500, 750 as simulated.

plot(out$trendcpt)

The envcpt function can fit several models and compare the fits with and without changepoints so this is why we specify models="trendcpt" so it only fits the single model.

This can be run from Python using rpy2 or alternative packages that can call R from Python. Unfortunately we don't have a Python implementation yet.

Related Solutions

Solved – Maximizing Log-Likelihood Estimation for Changepoint Detection

It is not clear from the presentation what distributional assumptions are being made in order to calculate the likelihoods.

It might be simpler for you to look at the recently published BreakoutDetection package published by the same authors: https://blog.twitter.com/2014/breakout-detection-in-the-wild.

But if you are more interested in learning about changepoint detection and how to use likelihoods then read on.

Normally you have a time series y which we assumes has n observations. In order to use likelihoods we need to make some assumptions about the distribution that y comes from. It is often assumed that the data come from Normal distribution (although this isn't always appropriate). Following a distributional assumption you need to decide which parameters of the distribution are allowed to change, e.g. mean, variance, both.

If all the parameters change then you proceed by splitting your data into 2 halves, before change and after change, and use maximum likelihoods to fit the parameters to each half. In this way it is like a normal analysis where you just have data points and you fit a model to that data. If not all parameters can change then you need to estimate those that don't change from the whole data (not always an easy task, especially when parameters are linked).

The trick with changepoint analysis is that you don't know where the change is, so you have to calculate the likelihood for each possible changepoint location and take the most likely as the hypothesized changepoint location. It is this location and likelihood that you then test to see if the change is significant by comparing the likelihood ratio to a threshold to see if the change is significant (comparing to c in slide 16).

Solved – Trend and Breakout detection in time series

There are several solutions to your problem. There are two forms of outliers:

Additive outlier (also called as pulses)
Level shifts (also called as break in trend).

I'm assuming you would need step 2 what you call as breakout detection. There are variety of methods and tools that could help you in this:

Open Source Software:

there are two commercial version, that I have worked with great success: 1. SAS using UCM and ARIMA frame works 2. SPSS time series outlier detection

It is beyond the scope of one answer to mention pros and cons of these methodologies. I must say RAD from Netflix and Breakout detection from twitter performs worse in your data. What this tells you in my opinion that Statisticians have developed elegant methods like the one in changepoint package that is able to easily detect breakpoints in your data. I have also had excellent success using SAS/SPSS.

Below are some of the results from applying all the 4 open source packages. Twitters breakout is the worst which does not recognize any breaks in your data. Netflix's RAD does point out all your additive outliers/pulses but fails to recognize level shift around data point ~1351.Both changepoint and breakpoint detects correctly level shifts in ~1351 and 1353 respectively. I'll expand my answer in the future. Let us know if this is what you are looking for.

library("breakpoint")
library("changepoint")
library("RAD")
library("ggplot2")

## Get Data

tsdata <- read.csv("mysql.bytes_received.csv", header = TRUE, sep = ",")
tsdata.value <- tsdata[,2]


## Use Breakpoint

bp.tsdata <- breakpoints(tsdata.value ~ 1)
bp.tsdata
breakpoints(bp.tsdata)


## Use Changepoint

ansmean=cpt.mean(tsdata.value)
ansmean
plot(ansmean,cpt.col='blue')


## USE RAD from Netflix

ggplot_AnomalyDetection.rpca(AnomalyDetection.rpca(as.numeric(tsdata.value),frequency = 1)) + ggplot2::theme_grey(base_size = 25)


## USE Breakout from Twitter
res = breakout(tsdata.value, method='multi', plot=TRUE)
res
res$plot

output from changepoint and breakpoint:

> ansmean
Class 'cpt' : Changepoint Object
       ~~   : S4 class containing 12 slots with names
              date version data.set cpttype method test.stat pen.type pen.value minseglen cpts ncpts.max param.est 

Created on  : Wed Mar 30 01:39:30 2016 

summary(.)  :
----------
Created Using changepoint version 2.2.1 
Changepoint type      : Change in mean 
Method of analysis    : AMOC 
Test Statistic  : Normal 
Type of penalty       : MBIC with value, 25.04703 
Minimum Segment Length : 1 
Maximum no. of cpts   : 1 
Changepoint Locations : 1353 
> bp.tsdata

     Optimal 3-segment partition: 

Call:
breakpoints.formula(formula = tsdata.value ~ 1)

Breakpoints at observation number:
1351 2769 

Corresponding to breakdates:
0.3196876 0.6552295

Output from RAD (NEtflix) and Breakout detection (Twitter), both fail to recognize breakouts:

Twitter's Breakout detection:

> res
$loc
integer(0)

$time
[1] 7.951

$pval
[1] NA

$plot

Best Answer

Related Solutions

Solved – Maximizing Log-Likelihood Estimation for Changepoint Detection

Solved – Trend and Breakout detection in time series

Related Question