Solved – the best way to perform sentence segmentation for textual analysis

classificationdata preprocessingmachine learningmaximum-entropytext mining

I am working on textual dataset containing data from official documents like reports by companies, legal documents, speeches by directors to shareholders etc. The content of textual documents are like:

  1. In respect of its fixed assets:
    a. The Company has maintained proper records showing full particulars including quantitative details and situation of fixed assets.

Your Company owns 99.99 percent of 20 Microns Sdn. Bhd. During the year under review, the said Company reported Gross turnover Rs. 295.74 Lacs

There a lot's of period which donot represents the sentence boundary. Until now i have implemented the regular expression based sentence segmentation by designing some rules and considerably good performance. But still it need to be updated when new words with period occur in the sentences and which seems like never ending task. I have searched and somewhere it is recommended to have Maximum Entropy classifier based word segmentation. I don't get exactly how to use MaxEnt classifier for word segmentation. How do i implement machine learning based word segmentation i.e based on maximum entropy classifier or any other which performs better than regular expression thing?

Best Answer

For an ML-based solution you generally would want a sequence learner like a conditional random field or a recursive neural network. Now each symbol has a binary label for classification representing whether the current symbol is a sentence boundary or not. Supervised learning on such data will result in your required classifier. You could theoretically also try a sliding window approach on feed forward architectures, although I'd expect it to perform quite a bit worse.

Perhaps have a look at the relevant parts of NLTK, if you want an overview of various sentence tokenization methods: http://www.nltk.org/api/nltk.tokenize.html

Here is an example of how to perform supervised sentence segmentation: http://www.nltk.org/book/ch06.html