Cart Likelihood-Ratio Information-Theory – Understanding the Relationship Between GINI Score and Log-Likelihood Ratio

cartginiinformation theorykullback-leiblerlikelihood-ratio

I am studying classification and regression trees, and one of the measures for the split location is the GINI score.

Now I am used to determining best split location when the log of the likelihood ratio of the same data between two distributions is zero, meaning the likelihood of membership is equally likely.

My intuition says that there must be a connection of some sort, that GINI has to have a good foundation in a mathematical theory of information (Shannon) but I don't understand GINI well enough to derive the relationship myself.

Questions:

  • What is the "first principles" derivation of GINI impurity score as a measure for splitting?
  • How does the GINI score relate to log of likelihood ratio or other information-theoretic fundamentals (Shannon Entropy, pdf, and cross entropy are part of those)?

References:

Shannon's entropy is described as:

$$
H \left(x \right) = \Sigma_{i} P\left(x_{i} \right)\log_{b} P\left(x_{i} \right)
$$

Extending this to the multivariate case we get:

$$
H \left(X,Y \right)= \Sigma_{x}\Sigma_{y} P\left(x,y \right)\log_{b} P\left(x,y \right)
$$

Conditional Entropy is defined as follows:

\begin{align}
H \left(X|Y \right) &= \Sigma_{y} p\left(x,y \right)\log_{b} \frac {p\left(x \right)} {p\left(x,y \right)} \newline
&\text{or,} \newline
H \left(X|Y \right) &= H \left(X,Y \right) – H \left(Y \right)
\end{align}

The log of the ratio of likelihoods is used for abrupt change detection and is derived using these. (I don't have derivation in front of me.)

GINI Impurity:

  • The general form of GINI impurity is $ I = \sum_{i=1}^m f_{i} \cdot \left( 1-f_{i}\right) $

Thoughts:

  • Splitting is done on a measure of impurity. High "purity" is likely the same as low entropy. The approach is likely related to entropy minimization.
  • It is likely that the assumed basis distribution is uniform, or possibly with hand-waving, Gaussian. They are likely making a mixture of distributions.
  • I wonder if the Shewhart chart derivation can apply here?
  • The GINI Impurity looks like the integral of the probability density function for a binomial distribution with 2 trials, and one success. $P(x=k)= \begin{pmatrix} 2\\ 1\end{pmatrix} p \left( 1-p \right) $

(additional)

  • The form is also consistent with a Beta-binomial distribution which is a conjugate prior for a Hypergeometric distribution. Hypergeometric tests are often used to determine which samples are over or under represented in a sample. There is also a relationship to Fisher's exact test, whatever that is (note to self, go learn more about this).

Edit:
I suspect that there is a form of the GINI that works very well with digital logic and/or rb-trees. I hope to explore this in a class project this fall.

Best Answer

I will use the same notation I used here: Mathematics behind classification and regression trees

Gini Gain and Information Gain ($IG$) are both impurity based splitting criteria. The only difference is in the impurity function $I$:

  1. $\textit{Gini}: \mathit{Gini}(E) = 1 - \sum_{j=1}^{c}p_j^2$
  2. $\textit{Entropy}: H(E) = -\sum_{j=1}^{c}p_j\log p_j$

They actually are particular values of a more general entropy measure (Tsallis' Entropy) parametrized in $\beta$:

$$H_\beta (E) = \frac{1}{\beta-1} \left( 1 - \sum_{j=1}^{c}p_j^\beta \right) $$

$\textit{Gini}$ is obtained with $\beta = 2$ and $H$ with $\beta \rightarrow 1$.

The log-likelihood, also called $G$-statistic, is a linear transformation of Information Gain:

$$G\text{-statistic} = 2 \cdot |E| \cdot IG$$

Depending on the community (statistics/data mining) people prefer one measure or the the other (Related question here). They might be pretty much equivalent in the decision tree induction process. Log-likelihood might give higher scores to balanced partitions when there are many classes though [Technical Note: Some Properties of Splitting Criteria. Breiman 1996].

Gini Gain can be nicer because it doesn't have logarithms and you can find the closed form for its expected value and variance under random split assumption [Alin Dobra, Johannes Gehrke: Bias Correction in Classification Tree Construction. ICML 2001: 90-97]. It is not as easy for Information Gain (If you are interested, see here).