My data is similar to the following data, but far bigger and more complex.
Apple
Banana
Those fruits
Tomato
Cocumber
These vegetables
I would like to get the following result:
Those fruits
These vegetables
Using the agrep/agrepl
function in R
I received a first result. However agrep
and agrepl
use the Levenshtein distance as default. An alternative would be the Jaccard distance.
Jaccard distance vs Levenshtein distance: Which distance is better for fuzzy matching?
There is already a similar question:
Properties of Levenshtein, N-Gram, cosine and Jaccard distance coefficients – in sentence matching. However I would like to know which distance works best for Fuzzy matching.
Extra credits: Are other distance measure (e.g. N-Gram, Cosine, Geometric, Manhattan) also useful for Fuzzy matching? Implementations in R
are also welcome.
Best Answer
You can use Naive Bayes algorithm:
Naive Bayes - Wikipedia