Solved – How to get a $p$-value from the Cochran-Armitage trend test

association-measurechi-squared-testgeneticsp-value

So, I'm working with GWAS SNP data and want to perform several tests for association between genotype and phenotype. There are two phenotypes (case and control) and 2 or three genotypes. Most of them are Chi-squared tests with different contingency tables, $2 \times 2$ or $2 \times 3$, one of them is the Cochran-Armitage trend test (CATT)

Once I have constructed the contingency table, I can easily get a $p$-value using the Apache commons math library for the Chi-squared tests. No problem.

However, the explanation of the CATT on Wikipedia is not sufficient for me to implement it (my statistics knowledge is limited and I'm still learning).

Like in the example, I suspect a linear trend, so my weights are $t = (0,1,2)$, which make the formula for $T$ to:
$$
T \equiv (N_{12}R_2 – N_{22}R_1) + 2(N_{13}R2 – N_{23}R1)
$$
and the one for the variance
$$
Var(T) = {{R_1 R_2} \over N} ( N(C_2+4C_3) – (C_2 – 2C_3)^2)
$$

I checked how the program PLINK does it, since it's already implemented there, but it differs slightly from the above formulas. The C++ source code there would correspond to this:
$$
T = {(N_{12}R_2 – N_{22}R_1) + 2(N_{13}R2 – N_{23}R1)\over N}
$$
and
$$
Var(T) = {{R_1 R_2} \over N} {( N(C_2+4C_3) – (C_2 – 2C_3)^2) \over N^2}
$$

Then it does calculates a chi-square value like this
$$
\chi^2_{T} = {T^2 \over Var(T)}
$$
and calculates the $p$-value like for any other chi-squared value with $df = 1$

I don't need to understand the theory completely, as long as my program calculates correctly, but understanding it would give me additional confidence.

Is this correct or legitimite? Is this how I'll get the $p$-value?

Best Answer

This is just a different definition of the statistic $T$. Call your statistic $T_1$ and the other $T_2$. Note the $T_2 = T_1/N$ and that is the reason that the variance of $T_2$ differs from $T_1$ by a factor of $1/N^2$. However you should note that the chi square stitistic is the same in either case. For $T_2$ there is a factor of $1/N^2$ in the numerator and denominator that cancels and does not appear in the formula using $T_1$. You use the same test statistic either way.