NOTE Before I begin, this F-measure is not related to precision and recall, and its title and definition is taken from this paper.

I have a feature known as the F-measure, which is used to measure formality in a given text. It is mostly used in gender classification of text which is what I'm working on as a project.

The F-measure is defined as:

F = 0.5 * (noun freq. + adjective freq. + preposition freq. + article freq. – pronoun
freq. – verb freq. – adverb freq. – interjection freq. + 100)

where the frequencies are taken from a given text (for example, a blog post).

I would like to normalize this feature for use in a classification task. Initially, my first thought was that since the value F is bound by the number of words in the given text (text_length), I thought of first taking F and dividing by text_length. Secondly, and finally, since this measure can take on both positive and negative values (as can be inferred from the equation) I then thought of squaring (F/text_length) to only get a positive value.

Trying this I found that the normalised values did not seem to be too correct as I started getting really small values in (below 0.10) for all the cases I tested the feature with and I am thinking that the reason might be because I am squaring the value which would essentially make it smaller since its the square of a fraction. However this is required if I want to guarantee positive values only. I am not sure what else to consider to improve the normalisation such that a nice distribution within [0,1] is produced, and would like to know if there is some kind of strategy involved to correctly normalise NLP features.

How should I approach the normalisation of my feature, and what might I be doing wrong?

Best Answer

You can try ordinary feature scaling, if you want to perform model training or cross-validation, i.e. if you have in a batch all the instances, so that min/max/mean/deviation can be computed.

For example, as Wiki says, 'determine the distribution mean and standard deviation for each feature. Next subtract the mean from each feature. Then divide the values (mean is already subtracted) of each feature by its standard deviation'

Also you can try classification without normalization, many ML algorithms (or their implementations) can deal with that, e.g. Weka includes normalization step by default.