Solved – Relationship between entropy and information gain

entropyestimationinformation theorymaximum-entropymutual information

Based on papers :1. Deniz Erdogmus, Member, IEEE, and Jose C. Principe, An Error-Entropy Minimization Algorithm for Supervised Training of Nonlinear Adaptive Systems

  1. J. Principe, D. Xu, and J. Fisher, “Information theoretic learning,” in Unsupervised Adaptive Filtering, S. Haykin, Ed. New York: Wiley, 2000, vol. I, pp. 265–319.

Entropy (Shannon and Renyis) has been used in learning by minimizing the entropy of the error as an objective function instead of the Mean Square error. Rationale is that minimizing entropy = maximizing mutual information.

Now, entropy = disorder = uncertainty. Higher the uncertainty, more is the entropy. Also,
higher entropy = high information content (used in compression), hence we cannot compress a signal with high entropy.

So, in view of the above I have questions which is Is mutual information another name for information gain? Next, if max entropy implies high information content then why do we minimize entropy of the error between output signal and the desired signal. Is there any proof which shows that minimizing entropy of error when used as a fitness function would mean that we are getting close to the true estimate of the unknown parameter?

Best Answer

// So, in view of the above I have questions which is Is mutual information another name for information gain? //

No. But MI can be expressed in terms of KL (i.e. Info Gain) http://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities

// Next, if max entropy implies high information content then why do we minimize entropy of the error between output signal and the desired signal. Is there any proof which shows that minimizing entropy of error when used as a fitness function would mean that we are getting close to the true estimate of the unknown parameter? //

Not sure if I fully understand the question, but there are proofs that minimizing KL is the only inference process that satisfies certain axioms one would deem reasonable from uncertain reasoning. Suggest you read "An Uncertain Reasoners Companion" - Jeff Paris.

KL and Entropy have both been shown to be the only measures of information (relative or absolute resp.) that satisfy 3 axioms that one would reasonably expect. Arthur Hobson proved for KL in "Concepts in Statistical Mechanics" (very expensive book), and Shannon proved for entropy (can be found in many Information Theory books).

The similarity between these 3 axioms and the proofs should hopefully help you understand the similarity in their meaning.

I believe it is the strong pure mathematical & philosophical foundation of Information Theory which is why Information Theoretic approaches perform so well and generalize like no other.

Related Question