Solved – Feature normalization in Text Classification

classificationnormalizationrtext mining

I'm doing Text Classification in R, and my initial features are just word frequency inside a Document. For example:

docID, label, word1, word2, word3, ...wordN
doc123, 1, 10, 2, 5, ..., 12
doc456, 1, 8, 1, 3, ..., 10
doc789, 0, 2, 10, 4, ..., 4

How should i approach scaling and normalization in this case? For example, if i normalize the frequency across rows, should i drop the 'wordN' feature? (since the sum at each row is 1).

I'm getting better results using this row normalization idea, but the Logistic regression output is complaining about the last column.

Thanks for any insights on this!

Best Answer

First of all, your features are not word frequencies - they are just counts of words of each type in entire document, I assume. Word frequency (usually called "term frequency" instead) for a wordN is a number of wordN in text divided by it's total count of words.

Usually term frequency is a good feature for text classification. However row normalization in your case returns real words' frequencies only if all word types are represented as features. Otherwise you ignore any other words, and you even get a singularity if none of features are placed in document.

In real text classification problems we always filter features to exclude errata, articles, names and etc. and to reduce the model's dimension. That's why you should compute word frequencies this way. For most algorithms no further normalization is necessary, but specific data preprocessing may be recommended for some particular algorithms.