Solved – Pairwise Correlations–Multiple Comparison Correction

correlationmultiple-comparisonsstatistical significance

I have a dataset with two sets of variables defining each sample. I am performing pairwise correlations between each variable in the first set with every variable in the second set.

I essentially have $n$ samples. For each sample, I have two sets of variables $[a_1, …, a_{m1}]$ and $[b_1, …, b_{m2}]$ that define that particular sample. Correlating all the variables in $a$ with those in $b$ gives me a final correlation matrix of size $m_1 \times m_2$. In addition, some of the variables within set $a$ may be correlated with each other and some of the variables within set $b$ may also be correlated with each other, but I am not assessing those correlations.

I'm trying to figure out:

  1. if I need to correct for multiple comparisons, and
  2. if I do, what method I should use.

I have tried Bonferroni, but with the large number of comparisons I get an extremely large adjusted p-value.

Best Answer

You are testing $m_1m_2$ hypotheses $H_0: \rho=0$ and thus the $p$-values obtained from these tests should be adjusted. If you want to stick with adjusting the family-wise error rate(FWER) then Bonferroni is the standard, if severely conservative, way to go.

However, as noted, you are not actually conducting $M=m_1m_2$ independent tests. This scenario is common in the world of genetics wherein experimenters test many genes for an association with a disease but variation in genes that are physically proximal to each other is correlated. A solution was proposed by Cheverud et al. (1983) to obtain the number of "effective comparisons" $M_{eff}$ so that one can still control the ever-popular FWER without over-correcting. The method is described in this open-access publication. As you would have to wade through some genetics-jargon so I will give you the gist:

Given mean-centered data $X$ with dimension $m \times n$ and correlation matrix $Z=X^TX$, one can obtain the eigenvalues of $Z$ $\lambda_i$, $i \in \{1...n\}$ via eigendecomposition a.k.a principal components analysis(PCA) . As explained in the article-

...if all variables are completely correlated, the first $\lambda$ equals the number of variables in the correlation matrix... the rest of the $\lambda$s are zero. In this case, the variance of the $\lambda$s is at its maximum, and it is equal to the number of variables in the matrix. Conversely, if no correlation exists among variables, all of the $\lambda$s will be equal to one, and the set of $\lambda$s will have no variance. Hence, the variance of the $\lambda$s will range between zero, when all the variables are independent, and $M$, where $M$ is the total number of variables included in the matrix. Therefore, the ratio of observed eigenvalue variance, $Var(\lambda_{obs})$, to its maximum ($M$) gives the proportional reduction in the number of variables in a set, and the effective number of variables ($M_{eff}$) may be calculated as follows:$$M_{eff}=1+(M-1)(1-\frac{Var(\lambda_{obs})}{M})$$

Thus, the adjusted threshold after Bonferroni correction would be $\alpha_{adj}=\frac{\alpha}{M_{eff,a}M_{eff,b}}$ where $M_{eff,}$ is the effective size for matrices $a$ and $b$ respectively. The eigenvalues can be calculated in R using functions from the base package (namely princomp or prcomp depending on the cardinality of $a$ and $b$).

Related Question