Solved – Compare several binary time series

binary datasimilaritiestime series

What is the best method to compare several time series taking into account not only the overall number of overlapping points (as Hamming distance does), but also to catch somehow the fact that the behaviour is similar.

E.g. given 3 time series

A = [0 1 1 1 1 0 0 0 1 1 1 1 1 0];
B = [0 0 1 1 0 0 0 0 0 1 1 0 0 0];
C = [0 0 0 0 0 0 0 0 1 1 1 1 1 0];

Using Hamming distance and other similar metrics, A and C would be the most similar. However, in terms of behaviour A and B are similar. Is there any metric that can catch that?

Best Answer

As others have already noticed in the comments, to get a meaningful comparison of "similarities", you would first decide yourself what kind of similarities you are looking for. You would also tell us more about nature of your data. The very basic kind of comparison to consider is to compare the individual time-points between the series using some kind of distance metric like Jaccard distance, or Hamming distance, that was already mentioned by you. This however does not take into account changes in "trends".

Since you didn't tell us much about nature of your data, I'll keep my answer general. Basically, your series can be thought as an effect of observing $i = 1,2,\dots,n$ non identically distributed and possibly dependent Bernoulli random variables

$$ Y_i \sim \mathcal{B}(\pi_i) $$

where $\pi_i$ is a probability of success that changes over time and may depend on some external factors, be auto-correlated, etc. Recall that if you had a sample of i.i.d. Bernoulli random variables, then mean would be a maximum likelihood estimator of the $\pi$ parameter. In here we are taking about different $\pi_i$'s for different $Y_i$'s, to this doesn't help us much until we realize that if we assume temporal dependence between the values, then we also assume that the values that the time-points that are close to each other are similar. This means that we could use moving average with some pre-defined window width $2h+1$ to estimate the moving average

$$ \hat m_i = (2h+1)^{-1} \sum_{j=-h}^h y_{i+j} $$

Next, you could treat $\hat m_i$ as a rough approximation of $\pi_i$ changing over time. Since the values of $\hat m_i$ would be continuous, you could use standard methods for comparing the continuous values (for example, simple correlation). The $h$ parameter would control the smoothness and the "speed" of changes in the series, with $h = (n-1)/2$ meaning that you basically assume your variables to be i.i.d. and $h=1$ meaning that you look only at the very "local" changes in the series.

Another approach would be to use changepoint analysis, that can be applied to binary data by assuming Bernoulli likelihood function to detect the "blocks" of similar values in the series, and then you could look at overlaps of the detected blocks.

If you know something more about your series, and you have some explanatory variables, then you could use logistic regression model, but a hierarchical one that models your series $Y_{1i}, Y_{2i}, ...$

$$ \pi_{ji} = \mathrm{logit}^{-1}(\boldsymbol{X}_{ji}\beta) \\ Y_{ji} \sim \mathcal{B}(\pi_{ji}) \\ $$

where $\boldsymbol{X}_{ji}$ serves as a vector of explanatory variables (e.g. time-points $t=1,2,\dots,n$ for linear trend, or indicators of seasonality), including dummy variables for coding the series membership ($j=0,1,2,...$) and their interactions with other variables (this heavily depends on the nature of your data!). If the series were "the same" then the effects of dummy variables and their interactions with other variables would be close to zero and non-significant. The big effects for interactions with dummies indicating series membership would tell you what is the nature of differences between the series (e.g. dummy interacts with seasonality, so the difference possibly lays in seasonality).

Related Question