Solved – How to compare or normalize two sets of probabilities

probabilityprobability-inequalities

I'm not sure how to ask this so I've written two versions: the short version is a pure stats question and the longer version explains my application.

The short version:

I have two sets of probabilities. The first set has a range from $5.8\times 10^{-3}$ to $2.52\times 10^{-9}$ and the second ranges from $3.51\times 10^{-4}$ to $1.59\times 10^{-6}$. Both sets have a quasi-logarithmic distribution. The sets are not the same length. The sets are comprised of similar data but have different ranges due to the sample sizes.

I need to normalize the probabilities somehow so that I can compare my input data against both. How do I do this? (remember I am a stats newbie!)

If you need more info, please read the longer version:

The longer version:

I am writing a software application that tries to predict the word a user is typing before he/she is finished. I have two source "dictionaries":

  1. The first is a list of individual words ("singlets") + a count (the number of times the word appears in a corpus). It contains around 65,000 unique English words.
  2. The second is a list of word triplets (e.g. "one of the") and the count of each triplet in a similar corpus. This list contains the most common 100,000 English triplets. Although it has more entries than the first list, it has far fewer unique words.

The probability of a dictionary entry is: $\frac{\mathrm{count}}{\sum \mathrm{count}}$

Both lists have quasi-logarithmic probability distributions, as you can imagine.

I must predict words as the user is typing. For example, if the user just typed "one of t", my goal is to predict what the third word might be. The reason I use both lists is because the triplet list provides more accuracy but the singlet list has more unique words. He might be typing a common triplet like "one of the" or he might be typing an uncommon triplet like "one of tomorrow's". As a result I need to combine results from both lists.

My result set is the most common singlets that start with "t" and the most common triplets that start with "one of t", sorted by probability. The problem is the probability ranges are very different (see above) due to the different sizes of the corpora and the nature of singlets vs. triplets, so my results are usually skewed toward one list or the other. I don't fully understand the mathematics behind this, but the bottom line is that the predictions are screwy.

Best Answer

Great problem!!

And the long + short version is a very good way of describing the problem!! (+1)

I would use condtional probabilities, which may require some computation on the fly. Following up on your example, you would normalize the triplets by the total probability of 'one of t' i.e. the sum of the probabilities of all the triplets that start by 'one of t' do the same for the singlets, ie the sum of all the probabilities of the singlets that start with 't'. This should scale your probabilities nicely and in a meaningful way.

So, to sum up, the updated probability of a word starting with $abc$ (say) is:

$$ P^*(word|abc) = \frac{P(word)}{ \sum_{w \in A} P(w) },$$

where $A$ is the set of words that start with the letters $abc$.

Related Question