Solved – Bayesian online changepoint detection (marginal predictive distribution)

bayesianchange pointinferencetime series

I am reading the Bayesian online changepoint detection paper by Adams and MacKay (link).

The authors start by writing the marginal predictive distribution:
$$
P(x_{t+1} | \textbf{x}_{1:t}) = \sum_{r_t} P(x_{t+1} | r_t, \textbf{x}_t^{(r)}) P(r_t | \textbf{x}_{1:t}) \qquad \qquad (1)
$$
where

$x_t$ is the observation at time $t$;
$\textbf{x}_{1:t}$ denotes the set of observation until time $t$;
$r_t \in \mathbb{N}$ is the current runlength (time since last changepoint, can be 0); and
$\textbf{x}_t^{(r)}$ is the set of observations associated with run $r_t$.

Eq. 1 is formally correct (see the reply below by @JuhoKokkala), but my understanding is that if you want to actually make a prediction about $x_{t+1}$ you'd need to expand it as follows:

$$
P(x_{t+1} | \textbf{x}_{1:t}) = \sum_{r_t, r_{t+1}} P(x_{t+1} | r_{t+1}, \textbf{x}_t^{(r)}) P(r_t | \textbf{x}_{1:t}) P(r_{t+1} | r_t) \qquad (1\text{b})
$$

My reasoning is that there might well be a changepoint at (future) time $t+1$, but the posterior $P(r_t | \textbf{x}_{1:t})$ only covers until $t$.

The point is, the authors in the paper make us of Eq. 1 as is (see Eqs. 3 and 11 in the paper), and not 1b. So, they seemingly ignore the possibility of a changepoint at time $t+1$ when predicting $x_{t+1}$ from data available at time $t$. At the beginning of Section 2 they say en passant

We assume that we can compute the predictive distribution [for $x_{t+1}$] conditional on a given run length $r_t$.

which perhaps is where the trick is. But in general, this predictive distribution should look something like Eq. 1b; which is not what they do (Eq. 11).

So, I am not sure I understand what's going on. Perhaps there is something funny going on with the notation.

Reference

Adams, R. P., & MacKay, D. J. (2007). Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742.

Best Answer

Both (1) and (1b) are correct. The OP has it right that (in this model) there might be a changepoint at $t+1$, and $x_{t+1}$ depends on whether there is a changepoint. This does not imply any problems with (1) as the possible values of $r_{t+1}$ are fully "covered" by $P(x_{t+1} \mid r_t, x_{1:t})$. $P(x_{t+1} | r_t, x_{1:t})$ means the conditional distribution of $x_{t+1}$ conditional on $(r_t, x_{1:t})$. This conditional distribution averages over "everything else", including $r_{t+1}$, conditional on $(r_t, x_{1:t})$. Just like one could write, say, $P(x_{t+1000} | x_t)$, which would take into account all possible configurations of changepoints as well as values of $x_i$s occurring between $t$ and $t+1000$.

In the remainder, I first derive (1) and then (1b) based on (1).

Derivation of (1)

For any random variables $A,B,C$, we have \begin{equation} P(A \mid B) = \sum_c P(A \mid B, C=c)\,P(C=c \mid B), \end{equation} as long as $C$ is discrete (otherwise the sum needs to be replaced by an integral). Applying this to $x_{t+1},x_{1:t},r_t$:

\begin{equation} P(x_{t+1} \mid x_{1:t}) = \sum_{r_t} P(x_{t+1} \mid r_t, x_{1:t})\,P(r_t \mid x_{1:t}), \end{equation} which holds no matter what the dependencies between $r_t$, $x_{1:t}$, $x_{t+1}$ are, that is, no model assumptions have yet been used. In the present model, $x_{t+1}$ given $r_t,x^{(r)}_t$ is assumed* to be conditionally independent of the values of $x$ from the runs before $x^{(r)}_t$. This implies $P(x_{t+1} \mid r_t, x_{1:t}) = P(x_{t+1} \mid r_t, x^{(r)}_t)$. Substituting this into the previous equation, we get

\begin{equation} P(x_{t+1} \mid x_{1:t}) = \sum_{r_t} P(x_{t+1} \mid r_t, x^{(r)}_t)\,P(r_t \mid x_{1:t}), \qquad \qquad \qquad (1) \end{equation} which is (1) in OP.

Derivation of (1b)

Let us consider the decomposition of $P(x_{t+1} \mid r_t, x^{(r)}_t)$ over possible values of $r_{t+1}$: \begin{equation} P(x_{t+1} \mid r_t, x^{(r)}_t) = \sum_{r_{t+1}} P(x_{t+1} \mid r_{t+1}, r_t, x^{(r)}_t)P(r_{t+1} \mid r_t, x^{(r)}_t). \end{equation}

Since it is assumed* that whether a changepoint occurs at $t+1$ (between $x_t$ and $x_{t+1}$) does not depend on the history of $x$, we have $P(r_{t+1} \mid r_t, x^{(r)}_t) = P(r_{t+1} \mid r_t)$. Furthermore, since $r_{t+1}$ determines whether $x_{t+1}$ belongs into the same run as $x_t$, we have $P(x_{t+1} \mid r_{t+1}, r_t, x^{(r)}_t)=P(x_{t+1} \mid r_{t+1}, x^{(r)}_t)$. Substituting these two simplifications into the factorization above, we get \begin{equation} P(x_{t+1} \mid r_t, x^{(r)}_t) = \sum_{r_{t+1}} P(x_{t+1} \mid r_{t+1}, x^{(r)}_t)P(r_{t+1} \mid r_t). \end{equation} Substituting this into (1), we get \begin{equation} P(x_{t+1} \mid x_{1:t}) = \sum_{r_t} \left(\sum_{r_{t+1}} P(x_{t+1} \mid r_{t+1}, x^{(r)}_t)P(r_{t+1} \mid r_t)\right)\,P(r_t \mid x_{1:t}), \qquad (1b) \end{equation} which is OP's (1b).

* Remark on the model's conditional independence assumptions

Based on quickly browsing the paper, I would personally like the conditional independence properties to be more explicitly stated somewhere, but I suppose that the intention is that $r$ is Markovian and the $x$:s associated to different runs are independent (given the runs).

Reference

Best Answer

Derivation of (1)

Derivation of (1b)

* Remark on the model's conditional independence assumptions

Related Solutions

Solved – In Bayesian inference, why are some terms dropped from the posterior predictive

Time Series Analysis – Choosing the Best Changepoint Detection Algorithm

Related Question