Solved – Python + Machine Learning : string matching problem

I have been given one problem to solve:

The problem is explained below:

The company maintains a dataset for specifications of all the products (nearly 4,500 at present) which it sells. Now each customer shares the details (name, quantity, brand etc.) of the products which he/she wants to buy from the company. Now, the customer while entering details in his/her dataset may spell the name of the product incorrectly. Also a product can be referred by many different ways in the company dataset. Example : red chilly can be referred as guntur chilly, whole red chilly, red chilly with stem, red chilly without stem etc.

I am absolutely confused about how to approach this problem. Should I use any machine learning based technique? If yes, then plz explain me what to do. Or, if it is possible to solve this problem without machine learning then also explain your approach. I am using Python.

The challenge : customer can refer to a product in many ways and the company also stores a single product in many ways with different specifications like variations in name, quantity, unit of measurements etc. With a labeled dataset I can find out that red bull energy drink(data entered by customer) is red bull (label) and red bull(entered by customer) is also red bull. But what's the use of finding this label? Because in my company dataset also red bull is present in many ways. Again I have to find all the different names of red bull in which they present in company dataset.

My approach:
I will prepare a Python dictionary like this:

{
"red chilly" : ['red chilly', 'guntur chilly', 'red chilly with stem'],
"red bull" : ['red bull energy drink', 'red bull']
}

Each entry in the dictionary is a product. whose keys are the sort of stem names of the products and the values are the all possible names for a product. Now customer enters a product name, say red bull energy drink. I will check in the dictionary for each key. If any value of that key matches, then I'll understand that the product is actually red bull and it can be referred as red bull and red bull energy drink, both ways in the company dataset. How's this approach ?

Best Answer

This problem was like me and I had to find the keyword to solve these issues.

The keyword: Fuzzy String Matching (Approximate String Matching).

Particularly the [Levenshtein distance][1] to determine the similarly of two string:

Example: red chilly vs red chilly using Levenshtein distance. Leven_distance(string1, string2) = 0.8.

You can also reference these links:

Best Answer

Related Solutions

Solved – Using standard machine learning tools on left-censored data

Solved – Is Machine Learning viable for Extracting product Information from webpages

Related Question