Correlation – Intuitive Explanation of the MIC Algorithm for Detecting Non-Linear Correlations

bioinformaticscorrelationinformation theorymutual informationnonparametric

More recently, I read two articles. Speed's article is about the history of the correlation, and the article by Reshef, et al. is about a new method called maximal information coefficient (MIC). I need your help to understand the MIC method to estimate non-linear correlations between variables.

Moreover, instructions for MIC's use in R can be found on the author's website (under Downloads):

I hope this will be a good platform to discuss and understand this method. My interest is in the intuition behind this method and how it can be extended in the way the author said:

…we need extensions of $\text{MIC}(X,Y)$ to $\text{MIC}(X,Y|Z)$. We will want to know how much data are needed to get stable estimates of MIC, how susceptible it is to outliers, what three- or higher-dimensional relationships it will miss, and more. MIC is a great step forward, but there are many more steps to take.

Citations
Speed, T. (2011). A Correlation for the 21st Century. Science, 334(6062), 1502–1503.

Reshef, D. N., et al. (2011). Detecting Novel Associations in Large Data Sets. Science, 334(6062), 1518–1524.

Best Answer

Is it not telling that this was published in a non-statistical journal whose statistical peer review we are unsure of? This problem was solved by Hoeffding in 1948 (Annals of Mathematical Statistics 19:546) who developed a straightforward algorithm requiring no binning nor multiple steps. Hoeffding's work was not even referenced in the Science article. This has been in the R hoeffd function in the Hmisc package for many years. Here's an example (type example(hoeffd) in R):

# Hoeffding's test can detect even one-to-many dependency
set.seed(1)
x <- seq(-10,10,length=200)
y <- x*sign(runif(200,-1,1))
plot(x,y)  # an X
hoeffd(x,y)  # also accepts a numeric matrix

D
     x    y
x 1.00 0.06
y 0.06 1.00

n= 200 

P
  x  y 
x     0   # P-value is very small
y  0

hoeffd uses a fairly efficient Fortran implementation of Hoeffding's method. The basic idea of his test is to consider the difference between joint ranks of X and Y and the product of the marginal rank of X and the marginal rank of Y, suitably scaled.

Update

I have since been corresponding with the authors (who are very nice by the way, and are open to other ideas and are continuing to research their methods). They originally had the Hoeffding reference in their manuscript but cut it (with regrets, now) for lack of space. While Hoeffding's $D$ test seems to perform well for detecting dependence in their examples, it does not provide an index that meets their criteria of ordering degrees of dependence the way the human eye is able to.

In an upcoming release of the R Hmisc package I've added two additional outputs related to $D$, namely the mean and max $|F(x,y) - G(x)H(y)|$ which are useful measures of dependence. However these measures, like $D$, do not have the property that the creators of MIC were seeking.

Related Solutions

Solved – Shannon entropy for non-stationary and non-linear signal

Shannon Entropy is a concept related to the distribution of a random variable, not to any particular realization of the r.v. The OP talks about a "non-stationary" signal. This implies that the OP has available a sequence of signals, which can be viewed as a realized sequence of a stochastic process, which is a sequence of random variables.

If the process was (strictly) stationary then each r.v. would have the same distribution, hence the same entropy, and the specific realization of the process (the data) could be used to form some estimate of this common entropy.

If the stochastic process is not strictly stationary, then each element-random variable of the process may have a different entropy. In that case the theoretical validity of the Entropy concept remains -but if non-stationarity is left totally unrestricted, then we do not have a sufficient amount of data to estimate these different entropies.

This is a general issue with non-stationary stochastic process, it affects estimation attempts of all measures, characteristics, moments, statistics etc related to such a process. If we do not somehow restrict the memory and the time-heterogeneity of the process, we won't have enough data to say anything about it.

So any question about Shannon Entropy and non-stationary data should include the assumed restrictions on non-stationarity (assumed based on theory and/or on data assessment), in order to be actually answerable.

Solved – Detecting non-linear relationships between distributions

Were you looking for something like $Distance Correlation$?

http://en.wikipedia.org/wiki/Distance_correlation

This will be non-zero for any sort of relationship between your $x$ and $y$. Therefore this can trap arbitrary non-linear relationships though interpreting the values is harder than interpreting correlation.

If this is what you need try the "energy" library in R.

set.seed(1234)
x<-rnorm(1000,0,1)
y<-x^2
cor(x,y)
[1] -0.03908369
library(energy)
dcor(x,y, R=500))
[1] 0.5478997

#to get a p-value for the distance correlation:
dcov.test(x,y)

        dCov test of independence

data:  index 1, replicates 500
nV^2 = 140.74, p-value = 0.001996
sample estimates:
     dCov 
0.3751596

Best Answer

Update

Related Solutions

Solved – Shannon entropy for non-stationary and non-linear signal

Solved – Detecting non-linear relationships between distributions

Related Question