Solved – LLR with Positive and Negative Values vs. Dunning method with Entropy-based Calculation

chi-squared-testentropyprobability

Ted Dunning has a blog post about calculating G2 (aka LLR) using Entropy calculations as components. I found this really intriguing.

Ted's original post:
http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html

And yesterday he was nice enough to help fix my Excel (and subsequent Java) implementation, correcting for +/- signs in the final formula. He even produced a corrected Excel sheet.

Discussion with Ted about his formula and his answer to my post:
https://math.stackexchange.com/a/693178/35616

But I think there's still a separate issue (or a misunderstanding on my part). Most LLR papers talk about negative and positive values. Things that are more prominent in A vs. prominent in B should change signs, so you know which direction the change is in. A couple examples of this discussion: Can the log likelihood ratio for a simple vs simple hypothesis take a negative value? and http://scg.unibe.ch/archive/papers/Kuhn09aLogLikelihoodRatio.pdf

Chi Squared is often compared to G2 (LLR). Although Chi Squared is always positive, since it's squared!, apparently the "G2" test doesn't hold to that, the "2" perhaps being a bit misleading.

Looking at Dunning's calculations, I don't see how the sign can ever flip?

His contingency tables:

              Corpus A  Corpus B   Row Totals
Target Word     k_11     k_12      totalRow1
Other Words     k_21     k_22      totalRow2
Column totals   col1     col2      grandTotal

You then calculate Entropy for row totals, column totals, and the overall 4 k cells:

H_rowTotals
H_colTotals
H_k

These are later combined in the final formula, along with grandTotal.

BUT the signs don't change when a word moves from column 1 to column 2:

For row1, if you transpose k_11 and k_12 (and assuming row 2 is unchanged), then row 1's row's total stays the same, and therefore H_rowTotals doesn't change sign.
Transposing k_11 and k_12 does change the order of the column totals, BUT the calculation of H_colTotals isn't impacted by order, so the sign doesn't change.
And then when calculating H_k, it also doesn't care what order the k cells are in, so it also doesn't change sign.
And of course the grand total is always >= 0.

So transposing k_11 and k_12 can't change the sign of H_rows, H_cols, H_k, nor grandTotal, which are the only 4 variable inputs into the final formula, so the final formulate can't change signs.

Ted was nice enough to upload a revised Excel sheet, but it actually demonstrates this non-sign change:
https://dl.dropboxusercontent.com/u/36863361/entropy-and-LLR-suspect-gist.xlsx

But whether you do:

              Corpus A  Corpus B
Target Word      10        0
Other Words       0       10

Or:

              Corpus A  Corpus B
Target Word       0       10
Other Words      10        0

You still get: +27.7

The only theory I can think of is maybe there's some alternative definition of Entropy that is order dependent; or something that approximates it.

Another theory I discarded was that maybe you change the signs of one of the columns, but that won't work since it would generate negative numbers in the probability calculation which would then be invalid as inputs to the log function.

To be clear, I'm super grateful for the previous help, but this sign change keeps bugging me. Also, posting this on the Stats Stack Exchange site, vs. the general Math site, since I think it's more specific.

Best Answer

I may have an answer, borrowed from a non-entropy form of the calculation.

Reviewing http://scg.unibe.ch/archive/papers/Kuhn09aLogLikelihoodRatio.pdf (end of page 1, start of page 2), they mention:

"By multiplying ... with the signum of p2 − p1 we can further distinguish between terms specific to the first corpus and ... the second"

Signum is just fancy for "is the result greater than or less than zero".

Revisiting the original Contingency Table:

                Corpus A   Corpus B
Target Word       k_11       k_12
Other Words       k_21       k_22
Column totals   col1Total  col2Total

Calculating p1 and p2:

p1 = k_11 / col1Total
p2 = k_12 / caol2Total

I believe signum( p2 − p1 ) is just a fancy way of saying if p2 < p1 then multiply the answer by -1.0.

If a term is used 20% of the time in corpus A and only 10% in B I believe the number should be positive. If it's % use is higher in B than in A then the number should be negative.

Staring at this, it seems like signum(p2-p1) give the opposite of that... but the Adrian Kuhn paper shows the equation in the form "−2 log λ", so maybe that flips it from what you start with using the Dunning model....

Or I'm otherwise confused about the meaning of +/-.

From http://ucrel.lancs.ac.uk/llwizard.html

Positive = more prominent in A, "+ indicates overuse in A relative to B"
Negative = more prominent in B, "- indicates underuse in A relative to B"

Well a bit of progress at least:

I have the sign changing between +/-, which is some progress.
Now I just need to confirm which direction means what. ;-)

Best Answer

Related Solutions

Solved – Alternative for Fisher’s exact test for count data in table bigger than 2×2

Related Question