Solved – Log-likelihood ratio in document summarization

natural languagetext-summarization

I initially asked this on stack overflow and was referred to this site, so here goes:

I am implementing some unsupervised methods of content-selection/extraction based document summarization and I'm confused about what my textbook calls the "log-likelihood ratio". The book Speech and Language Processing by Jurafsky & Martin briefly describes it as such:

The LLR for a word, generally called lambda(w), is the ratio between the probability of observing w in both the input and in the background corpus assuming equal probabilities in both corpora, and the probability of observing w in both assuming different probabilities for w in the input and the background corpus.

Breaking that down, we have the numerator: "the probability of observing w in both the input and in the background corpus assuming equal probabilities in both corpora" – How do I calculate what probability to use here?

and the denominator: "the probability of observing w in both assuming different probabilities for w in the input and the background corpus". – is this as simple as the probability of the word occurring in the input times the probability of the word occurring in the corpus? ex:

(count(word,input) / total words in input) * (count(word,corpus) / total words in corpus)

I've been looking over a paper my book references, Accurate Methods for the Statistics of Surprise and Coincidence (Dunning 1993), but I'm finding it difficult to relate to the problem of calculating LLR values for individual words in extraction based summarization. Any clarification here would be really appreciated.

Best Answer

With my limited knowledge, I think:

  1. "the probability of observing w in input" requires a distribution in order to compute the value
  2. "the probability of observing w in both the input and in the background corpus assuming equal probabilities in both corpora" means "the likelihood of observing w ... given that the probability for w is equal in both corpora".

Here's my formulation for it:


Formulating the problem a little:

  1. Hypothesis 1: P(w in input) = P(w in background) = p
  2. Hypothesis 2: P(w in input) = p1 and P(w in background) = p2 and p1 $\ne$ p2

The critical part is that you will need to assume a distribution here. Simplistically, we assume Binomial distribution for generating w in a text. Given the sampledata, we can use maximum likelihood estimation to compute the value for p, p1, and p2, and here they are:

  1. p = (count-of-w-in-input + count-of-w-in-background) / (input-size + background-size) = (c1 + c2) / (N1 + N2)
  2. p1 = c1 / N1
  3. p2 = c2 / N2

We want to know which hypothesis is more likely. Therefore, we compute the likelihood of each hypothesis and compare to each other (which is basically what the likelihood ratio does).

Since we assume binomial distribution, we can compute the likelihood of having c1 and c2.

For Hypothesis 1:

L(c1) = The probability of observing w in input = the likelihood of achieving c1 when there is N1 words assuming the probability p (or, in other words, selecting w for c1 times out of N1 times) is b(N1, c1, p) -- please see the binomial probability formula here

L(c2) = The probability of observing w in background = the likelihood of achieving c2 when there is N2 words assuming the probability p is b(N2, c2, p)

For Hypothesis 2, we can use p1 and p2 instead.

Now we want to know which hypothesis is more likely; we will need to some how compare an output value from each hypothesis.

But each hypothesis has 2 values, L(c1) and L(c2). How can we compare which hypothesis is more likely? --- We choose to multiply them together to achieve a single-valued output. (because it's analogous to geometry, I guess)

Related Question