Solved – Why don’t statisticians use mutual information as a measure of association

correlationmutual information

I've seen a couple talks by non-statisticians where they seem to reinvent correlation measures using mutual information rather than regression (or equivalent/closely-related statistical tests).

I take it there's a good reason statisticians don't take this approach. My layman's understanding is that estimators of entropy / mutual information tend to be problematic and unstable. I assume power is also problematic as a result: they try to get around this by claiming that they're not using a parametric testing framework. Usually this kind of work doesn't bother with power calculations, or even confidence/credible intervals.

But to take a devil's advocate position, is slow convergence that big of a deal when datasets are extremely large? Also, sometimes these methods seem to "work" in the sense that the associations are validated by follow-up studies. What's the best critique against using mutual information as a measure of association and why isn't it widely used in statistical practice?

edit: Also, are there any good papers that cover these issues?

Best Answer

I think you should distinguish between categorical (discrete) data and continuous data.

For continuous data, Pearson correlation measures a linear (monotonic) relationship, rank correlation a monotonic relationship.

MI on the other hand "detects" any relationship. This is normally not what you are interested in and/or is likely to be noise. In particular, you have to estimate the density of the distribution. But since it is continuous, you would first create a histogram [discrete bins], and then calculate MI. But since MI allows for any relationship, the MI will change as you use smaller bins (i.e. so you allow more wiggles). So you can see that the estimation of MI will be very unstable, not allowing you to put any confidence intervals on the estimate etc. [Same goes if you do a continuous density estimate.] Basically there are too many things to estimate before actually calculating the MI.

Categorical data on the other hand fits quite nicely into MI framework (see G-test), and there is not much to choose between G-test and chi-squared.