Solved – How to calculate the HHG (Heller Heller Gorfine) distance in R

In looking at this question and investigating some recent developments in measuring correlation, I came across the HHG (Heller Heller Gorfine) test. Heller et al. promote it as superior to the recently MIC test that made a splash in 2011. I've found their R code and a variant but I can't see how to get a distance measure that ranges between zero and one. How can I calculate that distance?

Here's a reproducible example:

## get the code
download.file("http://www.math.tau.ac.il/~ruheller/Software/HHG2x2_0.1-1.tar.gz", "HHG2x2_0.1-1.tar.gz")
install.packages("HHG2x2_0.1-1.tar.gz", repos = NULL, type="source")
library(HHG2x2)
writeChar(con="myHHG.R", getURL("https://raw.github.com/andrewdyates/HHG_R/master/R/myHHG.R", ssl.verifypeer = FALSE))
source("myHHG.R")

## built-in example
X = datagenCircle(50)
Dx = as.matrix(dist((X[1,]),diag=TRUE,upper=TRUE))
Dy = as.matrix(dist((X[2,]),diag=TRUE,upper=TRUE))
myHHG(Dx,Dy); pvHHG(Dx,Dy)

Which returns:

$sum_chisquared
[1] 5515.762

$sum_lr
[1] 3028.207

$max_chisquared
[1] 20.02963

$max_lr
[1] 11.30457

$pv
[1] 0.00229954

$output_monte
[1] 10001

$A_threshold
[1] 2.985682

$B_threshold
[1] -4.553877

How can I get a distance metric from this package? Or have I misunderstood what the HHG is about?

This seems to be the main source for the HHG test:

Heller, R., Y. Heller, et al. (2012). "A consistent multivariate test of association based on ranks of distances." Biometrika. 99 (4) doi:10.1093/biomet/ass070 http://arxiv.org/abs/1201.3522

Best Answer

Since the test statistic is not distribution free, it does not quantify strength of evidence on its own. For example, if there are two hypotheses tests, then it may be that the test statistic for the first hypothesis is larger than the test statistic for the second hypothesis, yet the p-value is smaller in the first test statistic. There is no notion of "effect size" associated with the HHG test, the strength of evidence is quantified by the p-value.

Note that a similar issue arises with the energy test for general distributions: it is based on permutations and not distribution free. Therefore, although the dCor statistic is between zero and one, if for two tests the two dCor values are above one and below zero, then it may be that the test with smaller dCor value actually has the smaller p-value.

Best Answer

Related Solutions

Solved – How does the Gower distance calculate the difference between binary variables’

Solved – How to calculate the distance in KNN for mixed data types

Related Question