Jaccard Similarity – Assessing Suitability for Non-Binary, Quantitative Data in Python

distance-functionsjaccard-similaritypython

I have a dataset with each row a country and 10 columns with numerical features like GDP,Electrcity consumption, GNI etc. I am trying to use distance metrics to find similarity between the countries and ultimately cluster them. I have tried quite a few distance metrics like Euclidean, Minkowski, canberra, jaccard etc. In case of jaccard (implementation in pdist in scipy) I don't think the resulting dissimilarity matrix makes sense as I have all 1's in the matrix other than 0 along diagonal. I read more on jaccard and it seems to use set union and intersection in the computation. So am I wrong to apply it in case of continous variables? I have a read a lot on jaccard and it seems to be useful only when data is represented in terms of 0/1 (present/absent). Please guide 🙂

Best Answer

Originally, Jaccard similarity is defined on binary data only. However, its idea (as correctly displayed by @ping in their answer) could be attempted to extend over to quantitative (scale) data. In many sources, Ruzicka similarity is being seen as such equivalent of Jaccard. A screenshot from the document of my SPSS macro !PROXQNT (can be found on my web-page, "Various proximities" collection):

enter image description here

Besides this, one should also keep in mind that in case of binary data, Jaccard sim = Ruzicka sim (= 1 - Soergel dis) = Similarity ratio = Ellenberg sim.

Therefore per backward logic, Similarity ratio and Ellenberg similarity can be considered too, as other candidates for the equivalence towards Jaccard.

enter image description here