Statistical references often present the Kullback-Leibler and $f$-divergences as being information-theoric in nature. Some examples:
-
The paper On Information and Sufficiency, by Kullback and Leibler (1951), that introduced the KL divergence in mathematical statistics.
-
The paper Importance Sampling and Necessary Sample Size: an Information Theory Approach (Sanz-Alonzo, 2016), that lower bounds importance sampling necessary sample sizes in terms of $f$-divergences. They say that « the bound is deduced from a new and simple
information theory paradigm for the study of importance sampling … ». -
And Information-theoretic Characterization of Bayes Performance and […] (Barron, 1986), that uses the KL divergence to study different choice of priors and their posterior asymptotics.
Yet my understanding of the KL divergence is entirely statistical: it is an expected weight of evidence, it is invariant under sufficient statistics, it (and other $f$-divergences) relate to the asymptotics of the likelihood ratio, etc.
In what sense is the Kullback-Leibler divergence information-theoric in nature? How are statistical arguments based on the Kullback-Leibler divergence of an information-theoric flavor?
Notes: I'm looking for a precise answer that would rigorously justify the use of "information theory" in the titles above, or argue that it is only a buzzword. I'm considering general infinite sample spaces.
See also: Kullback-Leibler divergence WITHOUT information theory.
Best Answer
This question attempts to explain (1) the information-theoretic interpretation of KL divergence, (2) how such an application lends itself to Bayesian analysis. What follows is directly quoted from pp. 148-150, Section 6.6. of Stone's Information Theory, a very good book which I recommend.
The application to Bayesian analysis can be found in Appendix H, pp. 157-158 of Stone's also very good book Bayes' Rule.