I may have an answer, borrowed from a non-entropy form of the calculation.
Reviewing http://scg.unibe.ch/archive/papers/Kuhn09aLogLikelihoodRatio.pdf (end of page 1, start of page 2), they mention:
"By multiplying ... with the signum of p2 − p1 we can further
distinguish between terms specific to the first corpus and ... the
second"
Signum is just fancy for "is the result greater than or less than zero".
Revisiting the original Contingency Table:
Corpus A Corpus B
Target Word k_11 k_12
Other Words k_21 k_22
Column totals col1Total col2Total
Calculating p1 and p2:
- p1 = k_11 / col1Total
- p2 = k_12 / caol2Total
I believe signum( p2 − p1 ) is just a fancy way of saying if p2 < p1 then multiply the answer by -1.0.
If a term is used 20% of the time in corpus A and only 10% in B I believe the number should be positive. If it's % use is higher in B than in A then the number should be negative.
Staring at this, it seems like signum(p2-p1) give the opposite of that... but the Adrian Kuhn paper shows the equation in the form "−2 log λ", so maybe that flips it from what you start with using the Dunning model....
Or I'm otherwise confused about the meaning of +/-.
From http://ucrel.lancs.ac.uk/llwizard.html
- Positive = more prominent in A, "+ indicates overuse in A relative to B"
- Negative = more prominent in B, "- indicates underuse in A relative to B"
Well a bit of progress at least:
- I have the sign changing between +/-, which is some progress.
- Now I just need to confirm which direction means what. ;-)
I will use the same notation I used here: Mathematics behind classification and regression trees
Gini Gain and Information Gain ($IG$) are both impurity based splitting criteria. The only difference is in the impurity function $I$:
- $\textit{Gini}: \mathit{Gini}(E) = 1 - \sum_{j=1}^{c}p_j^2$
- $\textit{Entropy}: H(E) = -\sum_{j=1}^{c}p_j\log p_j$
They actually are particular values of a more general entropy measure (Tsallis' Entropy) parametrized in $\beta$:
$$H_\beta (E) = \frac{1}{\beta-1} \left( 1 - \sum_{j=1}^{c}p_j^\beta \right)
$$
$\textit{Gini}$ is obtained with $\beta = 2$ and $H$ with $\beta \rightarrow 1$.
The log-likelihood, also called $G$-statistic, is a linear transformation of Information Gain:
$$G\text{-statistic} = 2 \cdot |E| \cdot IG$$
Depending on the community (statistics/data mining) people prefer one measure or the the other (Related question here). They might be pretty much equivalent in the decision tree induction process. Log-likelihood might give higher scores to balanced partitions when there are many classes though [Technical Note: Some Properties of Splitting Criteria. Breiman 1996].
Gini Gain can be nicer because it doesn't have logarithms and you can find the closed form for its expected value and variance under random split assumption [Alin Dobra, Johannes Gehrke: Bias Correction in Classification Tree Construction. ICML 2001: 90-97]. It is not as easy for Information Gain (If you are interested, see here).
Best Answer
the same as above from the same page http://users.utu.fi/attenka/trent.R