Solved – Using mutual information to estimate correlation between a continuous variable and a categorical variable

correlationinformation theorymutual information

As for the title, the idea is to use mutual information, here and after MI, to estimate "correlation" (defined as "how much I know about A when I know B") between a continuous variable and a categorical variable. I will tell you my thoughts on the matter in a moment, but before I advice you to read this other question/answer on CrossValidated as it contains some useful information.

Now, because we cannot integrate over a categorical variable we need to discretize the continuous one. This can be done quite easily in R, which is the language I have done most of my analyses with. I preferred to use the cut function, since it also alias the values, but other options are available as well. The point is, one has to decide a priori the number of "bins" (discrete states) before any discretization can be done.

The main problem, however, is another one: MI ranges from 0 to ∞, as it is an unstandardised measure which unit is the bit. That makes very difficult to use it as a correlation coefficient. This can be partly solved using global correlation coefficient, here and after GCC, which is a standardized version of MI; GCC is defined as follow:

enter image description here

Reference: the formula is from Mutual Information as a Nonlinear Tool for Analyzing Stock Market Globalization by Andreia Dionísio, Rui Menezes & Diana Mendes, 2010.

GCC ranges from 0 to 1, and therefore can easily be used to estimate the correlation between two variables. Problem solved, right? Well, kind of. Because all this process depends heavily on the number of 'bins' we decided to use during the discretization. Here the results of my experiments:

enter image description here

On the y-axis you have GCC and on the x-axis you have the number of 'bins' I decided to use for discretization. The two lines refers to two different analyses I conducted on two different (although very similar) datasets.

It seems to me that the usage of MI in general and GCC in particular is still controversial. Yet, this confusion may be the result of a mistake from my side. Either the case, I'd love to hear your opinion on the matter (also, do you have alternative methods to estimate correlation between a categorical variable and a continuous one?).

Best Answer

There is a simpler and better way to deal with this problem. A categorical variable is effectively just a set of indicator variable. It is a basic idea of measurement theory that such a variable is invariant to relabelling of the categories, so it does not make sense to use the numerical labelling of the categories in any measure of the relationship between another variable (e.g., 'correlation'). For this reason, and measure of the relationship between a continuous variable and a categorical variable should be based entirely on the indicator variables derived from the latter.

Given that you want a measure of 'correlation' between the two variables, it makes sense to look at the correlation between a continuous random variable $X$ and an indicator random variable $I$ derived from t a categorical variable. Letting $\phi \equiv \mathbb{P}(I=1)$ we have:

$$\mathbb{Cov}(I,X) = \mathbb{E}(IX) - \mathbb{E}(I) \mathbb{E}(X) = \phi \left[ \mathbb{E}(X|I=1) - \mathbb{E}(X) \right] ,$$

which gives:

$$\mathbb{Corr}(I,X) = \sqrt{\frac{\phi}{1-\phi}} \cdot \frac{\mathbb{E}(X|I=1) - \mathbb{E}(X)}{\mathbb{S}(X)} .$$

So the correlation between a continuous random variable $X$ and an indicator random variable $I$ is a fairly simple function of the indicator probability $\phi$ and the standardised gain in expected value of $X$ from conditioning on $I=1$. Note that this correlation does not require any discretization of the continuous random variable.


For a general categorical variable $C$ with range $1, ..., m$ you would then just extend this idea to have a vector of correlation values for each outcome of the categorical variable. For any outcome $C=k$ we can define the corresponding indicator $I_k \equiv \mathbb{I}(C=k)$ and we have:

$$\mathbb{Corr}(I_k,X) = \sqrt{\frac{\phi_k}{1-\phi_k}} \cdot \frac{\mathbb{E}(X|C=k) - \mathbb{E}(X)}{\mathbb{S}(X)} .$$

We can then define $\mathbb{Corr}(C,X) \equiv (\mathbb{Corr}(I_1,X), ..., \mathbb{Corr}(I_m,X))$ as the vector of correlation values for each category of the categorical random variable. This is really the only sense in which it makes sense to talk about 'correlation' for a categorical random variable.

(Note: It is trivial to show that $\sum_k \mathbb{Cov}(I_k,X) = 0$ and so the correlation vector for a categorical random variable is subject to this constraint. This means that given knowledge of the probability vector for the categorical random variable, and the standard deviation of $X$, you can derive the vector from any $m-1$ of its elements.)


The above exposition is for the true correlation values, but obviously these must be estimated in a given analysis. Estimating the indicator correlations from sample data is simple, and can be done by substitution of appropriate estimates for each of the parts. (You could use fancier estimation methods if you prefer.) Given sample data $(x_1, c_1), ..., (x_n, c_n)$ we can estimate the parts of the correlation equation as:

$$\hat{\phi}_k \equiv \frac{1}{n} \sum_{i=1}^n \mathbb{I}(c_i=k).$$

$$\hat{\mathbb{E}}(X) \equiv \bar{x} \equiv \frac{1}{n} \sum_{i=1}^n x_i.$$

$$\hat{\mathbb{E}}(X|C=k) \equiv \bar{x}_k \equiv \frac{1}{n} \sum_{i=1}^n x_i \mathbb{I}(c_i=k) \Bigg/ \hat{\phi}_k .$$

$$\hat{\mathbb{S}}(X) \equiv s_X \equiv \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2}.$$

Substitution of these estimates would yield a basic estimate of the correlation vector. If you have parametric information on $X$ then you could estimate the correlation vector directly by maximum likelihood or some other technique.