Solved – Distance correlation versus mutual information

correlationdistance-covariancemutual information

I've worked with the mutual information for some time. But I found a very recent measure in the "correlation world" that can also be used to measure distribution independence, the so called "distance correlation" ( also termed Brownian correlation): http://en.wikipedia.org/wiki/Brownian_covariance. I checked the papers where this measure is introduced, but without finding any allusion to the mutual information.

So, my questions are:

Do they solve exactly the same problem? If not, how the problems are different?
And if the previous question can be answered on the positive, what are the advantages of using one or the other?

Best Answer

Information / mutual information does not depend on the possible values, it depends only on the probabilities therefore it is less sensitive. Distance correlation is more powerful and simpler to compute. For a comparision see

http://www-stat.stanford.edu/~tibs/reshef/comment.pdf

Related Solutions

Solved – Why don’t statisticians use mutual information as a measure of association

I think you should distinguish between categorical (discrete) data and continuous data.

For continuous data, Pearson correlation measures a linear (monotonic) relationship, rank correlation a monotonic relationship.

MI on the other hand "detects" any relationship. This is normally not what you are interested in and/or is likely to be noise. In particular, you have to estimate the density of the distribution. But since it is continuous, you would first create a histogram [discrete bins], and then calculate MI. But since MI allows for any relationship, the MI will change as you use smaller bins (i.e. so you allow more wiggles). So you can see that the estimation of MI will be very unstable, not allowing you to put any confidence intervals on the estimate etc. [Same goes if you do a continuous density estimate.] Basically there are too many things to estimate before actually calculating the MI.

Categorical data on the other hand fits quite nicely into MI framework (see G-test), and there is not much to choose between G-test and chi-squared.

Solved – Using mutual information to estimate correlation between a continuous variable and a categorical variable

There is a simpler and better way to deal with this problem. A categorical variable is effectively just a set of indicator variable. It is a basic idea of measurement theory that such a variable is invariant to relabelling of the categories, so it does not make sense to use the numerical labelling of the categories in any measure of the relationship between another variable (e.g., 'correlation'). For this reason, and measure of the relationship between a continuous variable and a categorical variable should be based entirely on the indicator variables derived from the latter.

Given that you want a measure of 'correlation' between the two variables, it makes sense to look at the correlation between a continuous random variable $X$ and an indicator random variable $I$ derived from t a categorical variable. Letting $\phi \equiv \mathbb{P}(I=1)$ we have:

$$\mathbb{Cov}(I,X) = \mathbb{E}(IX) - \mathbb{E}(I) \mathbb{E}(X) = \phi \left[ \mathbb{E}(X|I=1) - \mathbb{E}(X) \right] ,$$

which gives:

$$\mathbb{Corr}(I,X) = \sqrt{\frac{\phi}{1-\phi}} \cdot \frac{\mathbb{E}(X|I=1) - \mathbb{E}(X)}{\mathbb{S}(X)} .$$

So the correlation between a continuous random variable $X$ and an indicator random variable $I$ is a fairly simple function of the indicator probability $\phi$ and the standardised gain in expected value of $X$ from conditioning on $I=1$. Note that this correlation does not require any discretization of the continuous random variable.

For a general categorical variable $C$ with range $1, ..., m$ you would then just extend this idea to have a vector of correlation values for each outcome of the categorical variable. For any outcome $C=k$ we can define the corresponding indicator $I_k \equiv \mathbb{I}(C=k)$ and we have:

$$\mathbb{Corr}(I_k,X) = \sqrt{\frac{\phi_k}{1-\phi_k}} \cdot \frac{\mathbb{E}(X|C=k) - \mathbb{E}(X)}{\mathbb{S}(X)} .$$

We can then define $\mathbb{Corr}(C,X) \equiv (\mathbb{Corr}(I_1,X), ..., \mathbb{Corr}(I_m,X))$ as the vector of correlation values for each category of the categorical random variable. This is really the only sense in which it makes sense to talk about 'correlation' for a categorical random variable.

(Note: It is trivial to show that $\sum_k \mathbb{Cov}(I_k,X) = 0$ and so the correlation vector for a categorical random variable is subject to this constraint. This means that given knowledge of the probability vector for the categorical random variable, and the standard deviation of $X$, you can derive the vector from any $m-1$ of its elements.)

The above exposition is for the true correlation values, but obviously these must be estimated in a given analysis. Estimating the indicator correlations from sample data is simple, and can be done by substitution of appropriate estimates for each of the parts. (You could use fancier estimation methods if you prefer.) Given sample data $(x_1, c_1), ..., (x_n, c_n)$ we can estimate the parts of the correlation equation as:

$$\hat{\phi}_k \equiv \frac{1}{n} \sum_{i=1}^n \mathbb{I}(c_i=k).$$

$$\hat{\mathbb{E}}(X) \equiv \bar{x} \equiv \frac{1}{n} \sum_{i=1}^n x_i.$$

$$\hat{\mathbb{E}}(X|C=k) \equiv \bar{x}_k \equiv \frac{1}{n} \sum_{i=1}^n x_i \mathbb{I}(c_i=k) \Bigg/ \hat{\phi}_k .$$

$$\hat{\mathbb{S}}(X) \equiv s_X \equiv \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2}.$$

Substitution of these estimates would yield a basic estimate of the correlation vector. If you have parametric information on $X$ then you could estimate the correlation vector directly by maximum likelihood or some other technique.