Solved – Normalizing Term Frequency for document clustering

clusteringdata miningfrequency

I have a problem understanding the normalization of Term Frequency weight in document Vector Space Model for clustering. Let's say that for document d I have counted occurences of all terms. I remember reading somewhere some time ago that they should be divided by maximum term frequency for that specific document, or sum of all frequencies, but I cannot find the source. Is that correct? If so, I would really appreciate getting some source for this information – I need it for my thesis.
My second question is about calculating TF-IDF weight – in this case, should the TF be normalized somehow, or should I take 'raw' frequencies?

Best Answer

A common misunderstanding is the term "frequency". To some, it seems to be the count of objects. But usually, frequency is a relative value.

TF/IDF usually is a two-fold normalization.

First, each document is normalized to length 1, so there is no bias for longer or shorter documents. This equals taking the relative frequencies instead of the absolute term counts. This is "TF".

Second, IDF then is a cross-document normalization, that puts less weight on common terms, and more weight on rare terms, by normalizing (weighting) each word with the inverse in-corpus frequency. Here it does not matter whether you use the absolute or relative frequency, as this amounts just to a constant factor across all vectors, so you will get different distances, but only by a constant factor (the corpus size).

To get the formulas right, try to understand why they are supposed to be one way or another. It's worthless to just copy some formula from a source that may even have it wrong. Instead, understand the mathematics and intentions behind it.