Correlation – How to Determine Minimum Sample Size for Spearman’s Correlation and Kendall’s Tau b

correlationkendall-tausample-sizespearman-rho

I've tried looking for scholarly articles to determine the minimum sample size that is required to perform either the Spearman's correlation and Kendall's Tau b hypothesis tests. However, I can't seem to find any documentation that strongly suggest the minimum sample size one should use. By any chance, would any of you know how many sample size is ideal?

Best Answer

For the purposes of a hypothesis test, there are two related approaches to finding an optimal sample size that are viable if you're willing to assume bivariate normality.

Power

To estimate minimal sample size at a given confidence level ($1-\alpha$) and power ($1-\beta$), we can use a modification of the equation for calculating the power of a Pearson correlation ($r$):

$$n=3+\bigg(\frac{z_{\alpha/2}+z_{\beta}}{z(r_{1})-z(r_{0})}\bigg)^2$$

Where the numerator represents the boundaries of a normal distribution at a specified $\alpha$ and $\beta$, respectively. The denominator takes the Fisher Z transformed estimated values of the expected ($r_1$) and null ($r_0$) correlation (Bonett, 2016). For a null hypothesis of no correlation $r_0 = 0$, though this need not be the case as the formula accommodates different values for a more specific null.

For the Kendall coefficient ($\tau$), we use a monotonic transform as per Fieller, Hartley, & Pearson (1957) to modify the formula slightly and solve for n:

$$n=4+.437\bigg(\frac{z_{\alpha/2}+z_{\beta}}{z(\tau_{b1})-z(\tau_{b0})}\bigg)^2$$

For the Spearman coefficient ($\rho$), following the transform in Bonett & Wright (2000), the formula is:

$$n=3+\bigg(1+\frac{\rho^2_{s1}}{2}\bigg)\bigg(\frac{z_{\alpha/2}+z_{\beta}}{z(\rho_{s1})-z(\rho_{s0})}\bigg)^2$$

The minimum sample size will therefore depend upon the expected level of $\tau$ or $\rho$ to reject the null, at a specified confidence level and power. More details on power can be found in Looney (2018).

Precision

Say you have a reasonable idea what range of values your $\tau$ or $\rho$ 'should' be in. In this case, you may want to estimate the sample size to achieve a particular confidence interval (CI) width (e.g. $\rho=.3\pm.1$) for precision. Provided the CI width ($w$) specified does not cross zero, you've effectively guaranteed sufficient power for a standard null hypothesis significance test as well.

Bonett & Wright (2000) established a two-step method to achieve this, again using a monotonic transform assuming bivariate normality. They provide a handy table of minimum sample sizes required across a range of different correlation values, CI widths, and alphas on page 26. But I'll outline the rough approach below.

First, calculate an initial imprecise approximation of the required sample size ($n_0$) at a given estimate of the Pearson, Spearman, or Kendall correlation (denoted more generally below as $\hat{r}$) and a desired confidence interval width (set at $w$). This given by:

$$n_0 = b+4c^2 (1-\hat{r}^2)^2 \bigg( \frac{z_{\alpha/2}}{w}\bigg)^2$$

Where

  • $b$ is equal to 3 for the Pearson and Spearman correlations and 4 for the Kendall; and
  • $c$ is equal to $1$ for the Pearson, $\sqrt{(1+\hat{r}^2)/2}$ for the Spearman, and $\sqrt{.437}$ for the Kendall.

Second, use a Fisher Z type transform to find the confidence intervals of $\hat{r}$ given the initial sample size approximation of $n_0$

$$\text{Lower Limit}=\frac{[\exp(2L_1)-1]}{[\exp(2L_1)+1]}$$

$$\text{Upper Limit}=\frac{[\exp(2L_2)-1]}{[\exp(2L_2)+1]}$$

Where

$$L_1=.5[\ln(1+\hat{r})-\ln(1-\hat{r})]-\frac{c(z_{\alpha/2})}{\sqrt{n_0-b}}$$

$$L_2=.5[\ln(1+\hat{r})-\ln(1-\hat{r})]+\frac{c(z_{\alpha/2})}{\sqrt{n_0-b}}$$

The idea is to now subtract the lower limit from the upper limit to find the estimated width ($w_0$) of the confidence interval of $\hat{r}$ given the initial sample size estimate $n_0$. This $n_0$ sample size will typically not be exactly correct, but we can use this information to adjust to find a better estimate by:

$$n = (n_0-b)\bigg(\frac{w_0}{w}\bigg)^2+b$$

Note however that this estimate may perform poorly at the upper ends of the correlation (say, $\hat{r}>.8$). In this case, a more conservative estimate is given in Doug Bonett's lecture notes as:

$$n = n_0\bigg(\frac{w_0^2}{w^2}\bigg)$$

In practice, I've found this gives approximately the same values at the lower ends of possible $\hat{r}$ values too, so maybe just default to that.

Problems

As stated a couple of times here - the critical assumption to all of these theoretical estimates is that both variables are normally distributed. This may be problematic. Frequently researchers will want to use $\tau$ and $\rho$ specifically because their variables are not normally distributed (but are instead skewed, ordinal, etc). The extent to which these methods will work in those circumstances is uncertain. Simulating data that resemble your own to empirically estimate power or precision across a range of possible sample sizes may be a reasonable work-around, but I'm not sure.


References

Bonett, D.G. (2016).Sample Size Planning for Behavioral Science Research. Retrieved from http://people.ucsc.edu/~dgbonett/sample.html.

Bonett, D. G., & Wright, T. A. (2000). Sample size requirements for estimating Pearson, Kendall and Spearman correlations. Psychometrika, 65(1), 23-28.

Fieller, E. C., Hartley, H. O., & Pearson, E. S. (1957). Tests for rank correlation coefficients. I. Biometrika, 44(3/4), 470-481.

Looney, S. W. (2018). Practical Issues in Sample Size Determination for Correlation Coefficient Inference. SM Journal of Biometrics & Biostatistics, 3(1), 1027.

Related Question