Solved – Significance Test for Jaccard Distance

distance-functionsjaccard-similarity

I am looking for a significance test for the Jaccard Distance (JD).

As an example, I have two datasets as follows:

Baseline: $\left| A\bigcap B \right|=57;\ \left| A\bigcup B \right|=275\quad \therefore \ JD=0.7927$

Evaluation: $\left| A\bigcap B \right|=126;\ \left| A\bigcup B \right|=433\quad \therefore \ JD=0.7090$

Is there a way to determine whether the JD at evaluation is significantly different from the baseline?

Or do I simply use the classical z-test of proportions? The z-test assumes that there is a significant difference from the baseline.

Best Answer

I found an article that describes the Jaccard index from a probabilistic perspective written by Real and Vergas in 1996: The Probabilistic Basis of Jaccard's Index of Similarity. A few years later, they even published tables of significance values (Table 3) in: Tables of significant values of Jaccard's index of similarity. Though they describe how to determine if J is significant, it may not directly answer your question... However, the statistical appoach given in (Real and Vergas, 1996) may be helpful to derive an appropriate methodology.

Btw, I would recommend not to use a z-test as I sometimes obtain significant results even for small differences between means due to small standard deviations... So to me, this test seems to be a bit overoptimitic and should not be applied when using (in my case) cross-validation/bootstrapping or similar approaches to assess the stability of estimates...