Spearman’s Rho – Understanding the Parameter

correlationhypothesis testingmultivariate analysisnonparametricstatistical significance

For Kendall's $\tau$ the parameter of interest is

$$E[\text{sign}(X_1-X_2)\text{sign}(Y_1-Y_2)]$$

where $(X_1,Y_1),(X_2,Y_2)$ are iid copies of $(X,Y)$.

The estimator is the famous Kendall's tau

But what about Spearman's $\rho$? I couldn't find any reliable reference about the parameter that the estimator is concerned about. This is very relevant since one can calculate p-values and confidence intervals, but… for what parameter?

The estimator defined by Spearman is very intuitive and easy to interpret when calculated, but what can I conclude with a significance test if there is no theoretical parameter involved?

Any help or reference will be appreciated!

Best Answer

Suppose $(X_1,Y_1),(X_2,Y_2),\ldots,(X_n,Y_n)$ are i.i.d random vectors with a continuous distribution. Let $R_i =\operatorname{Rank}(X_i)$ among $X_1,X_2,\ldots,X_n$ and $Q_i=\operatorname{Rank}(Y_i)$ among $Y_1,Y_2,\ldots,Y_n$, $\,i=1,2,\ldots,n$.

Spearman's rank correlation coefficient is then the sample quantity

$$r_S=\frac{\sum_{i=1}^n \left(R_i-\frac{n+1}2 \right)\left(Q_i-\frac{n+1}2 \right)}{\sqrt{\sum_{i=1}^n \left(R_i-\frac{n+1}2 \right)^2}\sqrt{\sum_{i=1}^n \left(Q_i-\frac{n+1}2\right)^2}}$$

It can be shown that

$$E(r_S)\to \rho_G \quad\text{ as }n\to \infty\,, \tag{$\star$}$$

where $\rho_G$ is the grade correlation coefficient defined as

$$\rho_G=\operatorname{Corr}(F(X_1),G(Y_1))$$

Here $F$ and $G$ are the distribution functions of $X$ and $Y$ respectively.

So $r_S$ is an asymptotically unbiased estimator of $\rho_G$, and at least in this sense $\rho_G$ is a parameter of interest and can be considered to be a population counterpart of $r_S$.

On the other hand, the statistic $$T_n=\frac1{\binom{n}{2}}\sum_{1\le i<j\le n}\operatorname{sgn}(X_i-X_j)\operatorname{sgn}(Y_i-Y_j)$$ is exactly unbiased for its population counterpart, Kendall's tau:

$$\tau=E\left[\operatorname{sgn}(X_1-X_2)\operatorname{sgn}(Y_1-Y_2)\right]$$

If you note that

$$\sum_{j:j\ne i}\operatorname{sgn}(X_i-X_j)=(R_i-1)-(n-R_i)=2\left(R_i-\frac{n+1}2\right)$$

and similarly

$$\sum_{j:j\ne i}\operatorname{sgn}(Y_i-Y_j)=2\left(Q_i-\frac{n+1}2\right)\,,$$

we have this relation between $r_S$ and $T_n$:

\begin{align} r_S&=\frac{12}{n(n^2-1)}\sum_{i=1}^n \left(R_i-\frac{n+1}2\right)\left(Q_i-\frac{n+1}2\right) \\&=\frac3{n(n^2-1)}\sum_{i=1}^n \left\{\sum_{j\ne i}\operatorname{sgn}(X_i-X_j)\right\}\left\{\sum_{k\ne i}\operatorname{sgn}(Y_i-Y_k)\right\} \\&=\frac3{n+1}T_n+\frac{3(n-2)}{n+1}U_n\,, \tag{1} \end{align}

where $$U_n=\frac1{n(n-1)(n-2)}\sum_{i\ne j\ne k}\operatorname{sgn}(X_i-X_j)\operatorname{sgn}(Y_i-Y_k)$$

Using the independence of $X_2$ and $Y_3$, we can write

\begin{align} E(U_n)&=E\left[\operatorname{sgn}(X_1-X_2)\operatorname{sgn}(Y_1-Y_3)\right] \\&=E \left[ E\left[\operatorname{sgn}(X_1-X_2)\operatorname{sgn}(Y_1-Y_3)\right]\mid X_1,Y_1 \right] \\&=E \left[ E\left[\operatorname{sgn}(X_1-X_2)\mid X_1 \right] E\left[\operatorname{sgn}(Y_1-Y_3) \mid Y_1\right] \,\right] \\&=E\left[F(X_1)-(1-F(X_1))\right] \left[G(Y_1)-(1-G(Y_1))\right] \\&=4 E\left[F(X_1)-\frac12\right]\left[G(Y_1)-\frac12\right] \\&=\frac13 \rho_G \tag{2} \end{align}

Equations $(1)$ and $(2)$ then together imply $(\star)$.

Typically we are interested in testing the null hypothesis $$H_0: X \text{ and }Y \text{ are independently distributed}$$

Under $H_0$, we have $\rho_G=0$ as well as $\tau=0$, which implies $E_{H_0}(r_S)=0$. The variance under $H_0$ can be shown to be $\operatorname{Var}_{H_0}(r_S)=\frac1{n-1}$. A large sample test is then based on

$$\sqrt{n-1}\,r_S \stackrel{d}\longrightarrow N(0,1)\quad, \text{ under }H_0$$

Note however, that this is not a test for $\rho_G=0$ and it does not give confidence intervals for $\rho_G$ or $E(r_S)$ since the asymptotic distribution of $r_S$ is derived only under $H_0$.

Reference:

  • Nonparametric Statistical Inference (5th ed.) by Gibbons and Chakraborti, pages 416-421.

  • Nonparametric Statistical Methods (3rd ed.) by Hollander/Wolfe/Chicken, pages 427-440.

  • Statistical Inference Based on Ranks by T.P. Hettmansperger.

Related question: Spearman's correlation as a parameter.

Related Question