Solved – Is a correlation analysis with Pearson’s correlation and Bonferroni Method a valid approach to find correlations between two sets of data

correlation

I'm studying Biomedical Computer science and I have to research a paper about genotype-phenotype association.

In this paper the authors use a correlation analysis by first calculating the Pearson correlation and then calculating the hypergeometric distribution to filter out insignificant associations.

http://www.biomedcentral.com/1471-2164/7/257
Under Methods/Associating genes to phenotypes

While the correlation measures the strength of association between an organism's genomic content and its phenotype, we also applied another method, exploiting the hypergeometric distribution function, to determine the significance of these
associations […] where a result smaller or equal to 20% response is considered negative. So for a given gene found in M species, the hypergeometric function provides the probability by random chance that the gene is found in m species which contain the COG and are also positive in the laboratory test.
The following criteria were applied to the correlated data set. The intersection between a specific COG and a phenotype had to contain at least 3 organisms, and for any intersection, 30% of the microbes had to share the COG. The scores were adjusted using the standard Bonferroni error correction for multiple testing.
Since the Bonferroni correction is one of the most conservative, it is likely that some biologically relevant associations were unnecessarily discarded. In this case $\alpha$ was set as less than equal to 0.01, therefore, any hypergeometric distribution score less than or equal to 0.0001 was deemed significant. Using these criteria, we set a 0.8 and a 0.9 correlation threshold to assess the significance of the COG-phenotype associations.

My question is: Is this a valid scientific correlation analysis or not? Are there any reservations?

Also, can you give me an idea for a good statistics book for science?

Best Answer

If you want to test that a given correlation coefficient is significantly different from 0 you would use the distribution of the sample Pearson product moment correlation under the null hypothesis. What they are asking here is different. In a specific case they use the hypergeometric distribution because if there is really no correlation they want to know what is the chance the the gene will occur in m out of the M species for each m between 0 and M. This does describe the hypergeometric distribution. So if m is sufficiently large you would infer that the distribution is not hypergeometric and consequently there is a real correlation. This seems to be an alternative test for non zero correlation. It is often possible to have several tests for the same null hypothesis in which case you would pick the one that is most powerful under reasonable assumptions for your data. It is not clear to me whether or not this hypergeometric test has good power characteristics.

Regarding good statistics textbooks, I don't know of any designed in general for science. If you want a good engineer statistics text or a medical statistics text i can make recommendations. Also I have over 600 books reviewed on amazon. So if you shop around for books on amazon there is a good chance that you can find a review by me for some of them. For engineering I would recommend looking for a book by Douglas Montgomery or one by Jay Devore. For Medicine look at Riffenburgh's "Statistics in Medicine" or the book by Altman. I also have written my own text "The Essentials of Biostatistics for Physicians, Nurses and Clinicians". For general statistics "The Practice of Statistics" by David Moore is an excellent introductory text.