Classification Metric – Calculating F1 Score for Change Point Detection

change pointclassificationf1)metric

I'm trying to evaluate my change point detection algorithm based on F1 score[1] defined as follows

Let $\mathcal{X}$ denote the set of change point locations provided by a detection algorithm and let $\mathcal{T_1}$ be the set of (single) human annotations. For a set of ground truth locations $\mathcal{T}$ we define the set of true positives $\operatorname{TP}(\mathcal{T}, \mathcal{X})$ of $\mathcal{X}$ to be those $\tau \in \mathcal{T}$ for which $\exists x \in \mathcal{X}$ such that $|\tau-x| \leq M$, while ensuring that only one $x \in \mathcal{X}$ can be used for a single $\tau \in \mathcal{T}$. The latter requirement is needed to avoid the double counting mentioned previously, and $M \geq 0$ is the allowed margin of error. The precision and recall are then defined as
$$
P =\frac{\left|\mathrm{TP}\left(\mathcal{T_1}, \mathcal{X}\right)\right|}{|\mathcal{X}|}
$$

$$
R = \frac{\left|\operatorname{TP}\left(\mathcal{T_1}, \mathcal{X}\right)\right|}{\left|\mathcal{T_1}\right|}
$$

Now, the question is: how to choose the right value of $x \in \mathcal{X}$ for given $\tau \in \mathcal{T_1}$? There are a few options, such as $x = \text{argmax}(|x – \tau| \leq M)$ or $x = \text{argmin}(|x – \tau| \leq M)$, but let's see an exmaple.

$$
\mathcal{X} = \{2, 3\}
$$

$$
\mathcal{T_1} = \{5, 8\}
$$

If we take $M = 5$ and $x = \text{argmin}(|x – \tau| \leq M)$ then for $\tau_1 = 5$ we take $x = 3$ and for $\tau_2 = 8$ we can't take $x = 2$ as $|2 – 8| > 5$, so $\left|\operatorname{TP}\left(\mathcal{T_1}, \mathcal{X}\right)\right| = 1$

On the other hand, if we take $x = \text{argmax}(|x – \tau| \leq M)$ then for $\tau_1 = 5$ we take $x = 2$ and for $\tau_2 = 8$ we take $x = 3$ and it's perfectly fine, so $\left|\operatorname{TP}\left(\mathcal{T_1}, \mathcal{X}\right)\right| = 2$.

So these both cases differ and thus the precision varies. Is there any approach that would be in some sense the most consistent one? Or maybe it is really up to me which one to choose?

[1] Gerrit J. J. van den Burg, Christopher K. I. Williams. (2022). An Evaluation of Change Point Detection Algorithms, page 5.
https://doi.org/10.48550/arXiv.2003.06222

Best Answer

The issue in changepoint detection is that there is no single measure that assesses all the important properties for changepoint detection. At a minimum you want a measure that assesses closeness to: 1) the correct number of changepoints, 2) the correct locations. Depending on how you are deciding on the changepoints you may also want to look at closeness of parameter estimates within segments and properties of the residuals.

So to answer your question, yes it really is up to you which one to choose. But I would advocate not to choose just one as it can't capture all the important aspects.

Related Question