Solved – Kendall Tau-b correlation coefficient becomes more significant with many additional ties

hypothesis testingkendall-taup-value

Say I have two vectors of length N,

x = [1, 10, 12, ..., 5, 6]
y = [2, 11, 10, ..., 7, 9]

I compute the Kendall tau-b rank order correlation on these two vectors and extract a p-value. If I take the same two vectors, but add additional "null" information to the end of each,

x = [1, 10, 12, ..., 5, 6, 0, 0, 0, ..., 0]
y = [2, 11, 10, ..., 7, 9, 0, 0, 0, ..., 0]

and compute the statistic again, I get a much more significant p-value. Why is this? Because extra zeros at the end will count as ties in both vectors, I don't think they should enter the calculation of tau-b or it's variance from what I've read about the statistic.

A simple example using python is

import numpy
from scipy.stats import kendalltau
x = numpy.random.rand(20).tolist()
y = numpy.random.rand(20).tolist()
z = [0]*20

# prints tau, p-value
print kendalltau(x, y)
# (0.042105263157894736, 0.79520761719370014)
print kendalltau(x+z, y+z)
# (0.69152542372881387, 3.2901769458112632e-10)

I have tested this is several languages (python, r, matlab, mathematica) and I keep getting this behavior. Can someone help me understand why these extra zeros will influence the p-value so significantly?

Best Answer

Look at the formula for Kendall's $\tau$. It is the number of rank concordant pairs - the number of discordant pairs divided by the sum of them.

While ties don't count as concordant or discordant themselves they cause you to count more concordant and discordant pairs. If all of the ties are low values then it will increase concordance. A hand calculated example will help. Let's say we start with 4 random values in each of group X and Y. We'll sort both groups by group X and then count the pairs. Once you've sorted the data by one column counting concordance is easy. Start at the top of the unsorted column and for each value add up all the values below that are greater as concordant and all of the values lesser as discordant. Ties won't be counted in. (columns C and D below are concordant and discordant respectively)

X Y C D
1 2 2 1 
2 3 1 1
3 1 1 0
4 4

That has 4 concordant and 2 discordant, which is a tau of 0.33 ((4-2) / (4+2)). Now, let's expand each list by adding three 0s to the front of each and put down the ranks. The 3 0s would be tied and the mean rank would be 2 with the next value, 1, starting at rank 4.

X Y C D
2 2 4 0
2 2 4 0
2 2 4 0
4 5 2 1 
5 6 1 1
6 4 1 0
7 7

Now I have 16 concordant pairs and 2 discordant (14/18 or 0.78). I haven't counted ties as concordant or discordant but each new value, even though it adds ties, also adds to the concordant and/or discordant count for the non-tied values. In the case of adding in all of the ties at a low rank it has the effect of always increasing concordance.