Solved – Jaccard distance vs Levenshtein distance for fuzzy matching

distance-functionsjaccard-similarityrsimilaritiestext mining

My data is similar to the following data, but far bigger and more complex.

Apple
Banana
Those fruits
Tomato 
Cocumber
These vegetables

I would like to get the following result:

Those fruits
These vegetables

Using the agrep/agrepl function in R I received a first result. However agrep and agrepl use the Levenshtein distance as default. An alternative would be the Jaccard distance.

Jaccard distance vs Levenshtein distance: Which distance is better for fuzzy matching?

There is already a similar question:
Properties of Levenshtein, N-Gram, cosine and Jaccard distance coefficients – in sentence matching. However I would like to know which distance works best for Fuzzy matching.

Extra credits: Are other distance measure (e.g. N-Gram, Cosine, Geometric, Manhattan) also useful for Fuzzy matching? Implementations in R are also welcome.

Best Answer

You can use Naive Bayes algorithm:

Naive Bayes - Wikipedia

Related Question