Correlation – Intuitive Explanation of the MIC Algorithm for Detecting Non-Linear Correlations

bioinformaticscorrelationinformation theorymutual informationnonparametric

More recently, I read two articles. Speed's article is about the history of the correlation, and the article by Reshef, et al. is about a new method called maximal information coefficient (MIC). I need your help to understand the MIC method to estimate non-linear correlations between variables.

Moreover, instructions for MIC's use in R can be found on the author's website (under Downloads):

I hope this will be a good platform to discuss and understand this method. My interest is in the intuition behind this method and how it can be extended in the way the author said:

…we need extensions of $\text{MIC}(X,Y)$ to $\text{MIC}(X,Y|Z)$. We will want to know how much data are needed to get stable estimates of MIC, how susceptible it is to outliers, what three- or higher-dimensional relationships it will miss, and more. MIC is a great step forward, but there are many more steps to take.

Citations
Speed, T. (2011). A Correlation for the 21st Century. Science, 334(6062), 1502–1503.

Reshef, D. N., et al. (2011). Detecting Novel Associations in Large Data Sets. Science, 334(6062), 1518–1524.

Best Answer

Is it not telling that this was published in a non-statistical journal whose statistical peer review we are unsure of? This problem was solved by Hoeffding in 1948 (Annals of Mathematical Statistics 19:546) who developed a straightforward algorithm requiring no binning nor multiple steps. Hoeffding's work was not even referenced in the Science article. This has been in the R hoeffd function in the Hmisc package for many years. Here's an example (type example(hoeffd) in R):

# Hoeffding's test can detect even one-to-many dependency
set.seed(1)
x <- seq(-10,10,length=200)
y <- x*sign(runif(200,-1,1))
plot(x,y)  # an X
hoeffd(x,y)  # also accepts a numeric matrix

D
     x    y
x 1.00 0.06
y 0.06 1.00

n= 200 

P
  x  y 
x     0   # P-value is very small
y  0   

hoeffd uses a fairly efficient Fortran implementation of Hoeffding's method. The basic idea of his test is to consider the difference between joint ranks of X and Y and the product of the marginal rank of X and the marginal rank of Y, suitably scaled.

Update

I have since been corresponding with the authors (who are very nice by the way, and are open to other ideas and are continuing to research their methods). They originally had the Hoeffding reference in their manuscript but cut it (with regrets, now) for lack of space. While Hoeffding's $D$ test seems to perform well for detecting dependence in their examples, it does not provide an index that meets their criteria of ordering degrees of dependence the way the human eye is able to.

In an upcoming release of the R Hmisc package I've added two additional outputs related to $D$, namely the mean and max $|F(x,y) - G(x)H(y)|$ which are useful measures of dependence. However these measures, like $D$, do not have the property that the creators of MIC were seeking.

Related Question