Solved – Finding multiple topics in short texts

machine learningtext mining

I am building a natural language understanding component for a dialog system. The input is a natural language sentence (usually a short one), and the output should be a set of zero or more classes from a pre-specified set. For example:

input: "I offer a salary of 20000 per month for work as a programmer with a company car"
output: ["OFFER(Salary=20000)", "OFFER(Job=programmer)", "OFFER(Car=yes)"]

I collected some examples from real dialogs, in order to train a classifier. My current approach is:

  • Features: I take the words of each sentence as features (I also tried word bi-grams and letter n-grams but the performance was worse).
  • Training: I train a separate binary classifier for each class. As positive examples, I take all sentences that have this class, and as negative examples, I take all sentences that don't have this class.
  • Running: I run each of the binary classifiers on the new sample, and return the set of classes that correspond to the classifiers that answered 'yes'.

This approach lead to too many false positives. Since my dataset contains many sentences with many classes (about 6 classes per sentence), the classifier gets confused between words that belong to different classes. For example, it takes the words "work as a programmer" as positive signal to the class "OFFER(Salary=20000)", because they appeared together in many training samples.

So, I edited my training data and separated each sample sentence to several sub-sentences, such that each sub-sentence matches just one class:

input: "I offer a salary of 20000 per month"
output: ["OFFER(Salary=20000)"]
input: "for work as a programmer"
output: ["OFFER(Job=programmer)"]
input: "with a company car"
output: ["OFFER(Car=yes)"]

The new approach lead to too many false negatives: since every sentence is a negative example of all classes except its own class, the classifier takes the words "a salary of 20000" as negative signal to the class "OFFER(Job=programmer)". Therefore, for compound sentences such as "I offer a salary of 20000 per month for work as a programmer with a company car", the classifier finds a single class and misses all other classes.

I also created a mixed dataset – half of the one-class samples and half of the multi-class samples (the other half I used for testing), and got much better result. However, this approach seems like a hack, and I would like to know if there is a more professional approach.

Any suggestions for solution will be welcome…

NOTE: I use the winnow classification algorithm, which currently has the best score (I also tried Naive Bayes, SVM, neural network and perceptron).

Some numbers:

  • Performance of classifier trained on multi-class sentences and tested on single-class sentences: Precision=69% Recall=97% F1=81%
  • Performance of classifier trained on single-class sentences and tested on multi-class sentences: Precision=99% Recall=32% F1=48%
  • Performance of classifier trained on a mixed dataset (1/2 single-class and 1/2 multi-class), and tested on the other half: Precision=88% Recall=89% F1=89%

Best Answer

The approach of training the classifier on single-class "sentences" (actually, not necessarily sentences, but phrases) is promising, but where it breaks down is you are training the classifier on a different type of data than what you're using it on.

You need a pre-processing step that breaks the sentences down into phrases, and use that same phrase-level data for both training and actual classification. Then, in your reporting step, you could aggregate all of the positive phrase-level results to the level of the input sentence (each sentence is a list of phrases with some positive identification of classes, and so you would simply combine all of the phrase-level positive results).

This approach isn't perfect either - it will often fail properly account for contextual information like negation and pronouns. But, it's the next logical step if you're building it from the ground up.