Solved – Best similarity measure for binary wine data

binary datacategorical datasimilarities

I'm creating a dataset of wine grape varieties and their associated flavors/aromas. Here's a schematic of the data:

            Flavor1       Flavor2       Flavor3       Flavor4    ...

   Grape1      1             1             1             0

   Grape2      0             0             0             1

   Grape3      0             0             1             0

   Grape4      1             1             1             1

   ...

1 = grape has the flavor

0 = grape doesn't have the flavor

I plan to plot histograms for each grape variety and do a visual check, but I imagine there's some similarity matrix I could construct for these data. I'm not the most advanced statistics user, so something readily implementable in a statistical package would be great, if at all possible.

Thank you!

Best Answer

I suggest to try one of the following distance measures

If you are on python, the following package gives you a list of algorithms to experiment with out of the box

Just check out the "Metrics intended for boolean-valued vector spaces"

Here you can get a short recipe for doing so

This thread might be a good further reading

Related Question