Solved – How to think of features in NLP problems

feature-engineeringmachine learningnatural languagetext mining

I am working on a Named Entity Recognition (NER) project. Instead of using an existing library, I decided to implement one from scratch because I wanna learn the basics of how PGMs work under the hood. I converted the words in sentences into feature vectors. The features are manually picked by me, and I can only think of roughly ~20 features (such as: "Is the token capitalized?", "Is the token an English word?", etc.). However, I've heard good NER algorithms represent tokens using way more than 20 features, sometimes hundreds of features. How do they manage to think of so many features? Are there any recommended best practices in feature construction?

Best Answer

Indeed to have an efficient NER you need a lot of features. If you start from scratch (what I did first as well) it's really hard to figure out what features could be used other than obvious ones you mentionned. But what really boosted my scores on the one I built was introducing context grammar, tagging and parsing the sentence and use it. You can also add a word vectorial representation. Last, it seems important to add some word-specific features when you encounter difficult cases (e.g. the New-York Times, you can add a feature specially for this). You should also add big dictionnaries and have dimensions of your feature vector that tell if the word belongs to a specific dictionary...

Good luck, it's a really hard problem to get a good NER and building feature is most of the time linguistic knowledge more than mathematical ones!

Related Question