What method can I use to test if there is a correlation between two sets of data? The correlation coefficient works if there is a linear association, but if I have two sets that are clearly (visually by graph) correlated in a non-linear way, how can I test that? Is there a coefficient or a special method?
Solved – Correlation coefficient for sets with non-linear correlation
correlationnonlinearnonparametric
Related Solutions
Is it not telling that this was published in a non-statistical journal whose statistical peer review we are unsure of? This problem was solved by Hoeffding in 1948 (Annals of Mathematical Statistics 19:546) who developed a straightforward algorithm requiring no binning nor multiple steps. Hoeffding's work was not even referenced in the Science article. This has been in the R hoeffd
function in the Hmisc
package for many years. Here's an example (type example(hoeffd)
in R):
# Hoeffding's test can detect even one-to-many dependency
set.seed(1)
x <- seq(-10,10,length=200)
y <- x*sign(runif(200,-1,1))
plot(x,y) # an X
hoeffd(x,y) # also accepts a numeric matrix
D
x y
x 1.00 0.06
y 0.06 1.00
n= 200
P
x y
x 0 # P-value is very small
y 0
hoeffd
uses a fairly efficient Fortran implementation of Hoeffding's method. The basic idea of his test is to consider the difference between joint ranks of X and Y and the product of the marginal rank of X and the marginal rank of Y, suitably scaled.
Update
I have since been corresponding with the authors (who are very nice by the way, and are open to other ideas and are continuing to research their methods). They originally had the Hoeffding reference in their manuscript but cut it (with regrets, now) for lack of space. While Hoeffding's $D$ test seems to perform well for detecting dependence in their examples, it does not provide an index that meets their criteria of ordering degrees of dependence the way the human eye is able to.
In an upcoming release of the R Hmisc
package I've added two additional outputs related to $D$, namely the mean and max $|F(x,y) - G(x)H(y)|$ which are useful measures of dependence. However these measures, like $D$, do not have the property that the creators of MIC were seeking.
You can for example generate data from a bivariate normal distribution. The off-diagonal entry of the variance-covariance matrix is the covariance. In R, this can readily be done with rmvnorm.
Example Generate $1000$ realisations from $X=(X_{1}, X_{2})' \sim N(\mu, \Sigma)$ with $$\mu = (-1, 5)', \quad \Sigma_{11} = V(X_{1}) = 0.7, \quad \Sigma_{22}= V(X_{2}) = 0.1$$ and $\Sigma_{12} = \Sigma_{21} = \textrm{Cov}(X_1, X_2)$ such that $\textrm{Cor}(X_{1}, X_{2})=0.85$.
> #------load the package------
> library(mvtnorm)
> #----------------------------
>
> #------compute the covariance such that cor(X1, X2) = 0.85------
> covariance <- 0.85 * sqrt(0.7) * sqrt(0.1)
> #---------------------------------------------------------------
>
> #------variance-covariance matrix------
> sigma <- matrix(c(0.7, covariance, covariance, 0.1), nrow=2, byrow=TRUE)
> sigma
[,1] [,2]
[1,] 0.7000000 0.2248889
[2,] 0.2248889 0.1000000
> #--------------------------------------
>
> #------data generation------
> test <- rmvnorm(n=1000, mean=c(-1, 5), sigma=sigma)
> #---------------------------
>
> #------compute the empirical correlation on this particular data------
> cor(test[, 1], test[, 2])
[1] 0.8478849
> #---------------------------------------------------------------------
$$$$
NB: You can also generate data according to a linear regression model: $X_2 = a + bX_1 + \epsilon$.
Best Answer
In case of non-linear correlation Spearman's Rank-correlation is one method
and one more method is called Kendall's Tau
R code for Spearman's rank correlation:
R code for Kendall's rank correlation: