Solved – Comparing two word distributions

I want to compare two empirical distributions of words in two different texts to see if they are reasonably similar. So for each text I perform the usual steps like stopword removal and stemming, and then count word occurences in both texts to acquire the two discrete distributions.

For getting the same region for both distributions, only the interesection of both underlying regions (word sets) is taken, the rest is discarded.
The problem I'm facing now is which method to choose for measuring similarity or distance between my two discrete distributions X and Y.

From my Data Analysis classes I remember the Kolmogorov-Smirnow-Test, which doesn't seem to be applicable, since my data is discrete (integers!). However, if overly conservative results are acceptable, it seems to be applicable yet.

To my knowledge, the $\chi$²-Test is not applicable either, since I'm not interested in the dependency of X and Y. I can safely assume that a word's frequency in one text doesn't depend on the word's frequency in the other text.

What's keeping me from simply using the Kullback-Leibler-Divergence is that it is not symmetric, and I cannot assume either text's distribution as the model or data distribution. However, I could perform the KL-calculations twice, for $KL(X,Y)$ and $KL(Y,X)$ and take the larger/smaller value if I want a more/less conservative similarity estimation. It's not perfect, for the same reasons as performing a "single pass" is.

Then there's the Bhattacharyya distance, which seems to do the job just fine and is free of assumptions regarding the nature of the data. It also seems to be pretty straightforward to implement.

I'm a computer science grad student, and my statistical background sadly is not as strong as I wish at this moment. Are there any good reasons for not proceeding with the Bhattacharyya distance? Or is there another good candidate for the task?

Thank you.

Edit: As was pointed out, my first approach at getting the same region for both distributions is heavily biased. A nice example for this would be two distributions with only one common word (stem). This would result in 100% similarity, which is obviously very wrong.

A better idea should be discarding every word stem that occurs less than $threshold times and using the union of both remaining word sets as a common region for the two distributions.

Best Answer

You could use a customization of the Min-Hash algorithm. The problem then falls down to comparing hash-values of your distributions by their Hamming-Distance.

You start by seeding your random number generator (PRG) with $c$.
You draw $n$ samples from the first distribution $p_A$
You seed your PRG again with $c$
you sample $n$ words from your other distribution $q_A$.
You compare both samples by means of the Hamming-Distance. (i.e 500 our of 700 samples agree --> smilarity of 0.7)

If all words $a$ from your distributions $p_A(a)$ would have the same propability $p_A(a) = \frac{1}{|A|}$ the process would exactly result in Min-Hash. And Min-Hash is an approximation to the Jaccard Similarity of both sets of words. However, by enlarging $n$ you are able to get the exact value of the Jaccard-Index for $n \rightarrow \infty$.

With this scheme you can assign arbitrary probabilities to your words and nontheless are able to get a similarity score.

Anyhow, the process is not that simple, because typically the Alias Method of Vose is used to draw samples from discrete distributions. Then the drawn samples are not independent from each other (because sampling can use up a variable number of random numbers depending on $p_A(x)$), which is bad when comparing numbers by Hamming Distance later.

Hence, you are required to reseed after every drawn sample with a number $c+i$ where $i$ grows up to $n$.

Best Answer

Related Solutions

Solved – Bhattacharyya distance for histograms

Solved – Differences between Bhattacharyya distance and KL divergence

Related Question