Solved – Confidence interval on point biserial correlation coefficient

confidence intervalcorrelation

How can I calculate the confidence of a point biserial correlation coefficient? I'm calculating the point biserial correlation between a continues variable (score on a test), and a dichotomous variable (passing an interview). How can I calculate any sort of measure of the confidence of this value? For example, if I calculate the correlation between [1, 0] and [100.0, 20.0], I get a correlation of 1.0. Clearly, however, these results could be due to chance alone.

The following paper describes a method. I do not, unfortunately, follow the syntax. (What's σ subscript r?). I would be much obliged if someone could explain point biserial confidence calculation to me (or point me to any stats library with this functionality).

Thanks a lot!

Best Answer

The model used to get the confidence interval to which you referred has several parts. Following Tate (1954), suppose we have two variables $X$ and $Y$, where $Y$ is the continuous variable and $X$ is the dichotomous variable taking values 0 and 1.

$X$ is a Bernoulli random variable with probability $p$ that $X=1$.

$Y$ is normally distributed with mean $\mu_0$ when $X=0$ and mean $\mu_1$ when $X=1$, both with equal variance $\tau^2$.

Let the standardized difference between the two means be $$\Delta = \frac{\mu_1 - \mu_0} {\tau}.$$ Then the true point-biserial value in this case is given by $$\rho(X,Y) = \Delta \sqrt { \frac {p(1-p)}{1 + p(1-p)\Delta} }.$$

Asymptotically, the distribution of the sample point-biserial $r$ is normal with mean $\rho=\rho(X, Y)$ and variance $$ \frac{ 4 p(1-p) - \rho^2(6p(1-p) -1) } {4np(1-p)} (1-\rho^2)^2,$$ which is equivalent to the formula in the reference.

(The $\sigma_r$ just refers to the square root of that quantity. It is the standard deviation of the asymptotic distribution.)

This being an asymptotic distribution, the sample size needs to be "large enough". Tate (1954) does provide details on calculating the distribution with small sample sizes, but this requires more work.

To apply this formula, you need to know $p$. In some cases, you may have a good idea of what that is, but in others you may not.

For the sake of example, let's say that $p=0.4$, $\mu_0= 10$, $\mu_1=14$, and $\tau=2$. This gives a standardized difference $\Delta = (14 - 10)/2 = 2$.

Then, the true point-biserial is $$ \rho = 2 \sqrt { \frac {0.4(0.6)}{1 + 0.4(0.6)2} } \approx 0.8053873.$$

Here is some R code for generating a small sample with $n=20$, calculating the point-biserial, and showing the computation of the confidence interval:

set.seed(101)
X <- rbinom(20, 1, 0.6)
Y <- rnorm(20, mean = ifelse(X==0, 10, 14), sd=2)

cbind(X, Y)

#      X         Y
# [1,] 1 15.052896
# [2,] 1 12.410311
# [3,] 0 12.855511
# [4,] 0  7.066361
# [5,] 1 13.526633
# [6,] 1 13.613324
# [7,] 1 12.300491
# [8,] 1 14.116931
# [9,] 0  8.364659
#[10,] 1  9.899384
#[11,] 0  9.672489
#[12,] 0 11.417044
#[13,] 0  9.464039
#[14,] 0  7.072156
#[15,] 1 15.488872
#[16,] 1 11.179220
#[17,] 0 10.934135
#[18,] 1 13.761360
#[19,] 1 14.934478
#[20,] 1 14.996271

Now, calculate the sample point-biserial. Notice that when the data are coded as 0/1 we can just use the usual Pearson correlation:

r <- cor(X, Y)
r

#[1] 0.8017445

Set up a function to calculate $\sigma_r$ for different values of $p$:

sigma_r <- function(r, n, p) {    
  num <- 4 * p * (1-p) - r^2 * (6 * p * (1-p) - 1)
  den <- 4 * n * p * (1-p)
  sqrt( (num / den) * (1-r^2)^2 )
}

Finally, calculate some 95% confidence intervals with different values of $p$. (Note that the upper 2.5% quantile of the standard normal is about 1.96. That is, $z_{0.05/2} \approx 1.96$.)

c( r - 1.96 * sigma_r(r, 20, 0.5), r + 1.96 * sigma_r(r, 20, 0.5) )
#[1] 0.6727809 0.9307082

c( r - 1.96 * sigma_r(r, 20, 0.6), r + 1.96 * sigma_r(r, 20, 0.6) )
#[1] 0.6702605 0.9332285

c( r - 1.96 * sigma_r(r, 20, 0.7), r + 1.96 * sigma_r(r, 20, 0.7) )
#[1] 0.6616289 0.9418601

There are two main issues. The first is that we do not know $p$. The second is that the sample size has to be large enough. And, the sample should be a random sample. Okay, there are three main issues. And, the model needs to be appropriate. Okay, among the many issues...

It seems intuitive to substitute the usual estimate for $p$, namely the sample proportion of the $X$'s that are 1. But, this theoretical derivation does not cover that.

A "large enough" sample size will probably depend on the size of $p$ and of $r$.

It also seems reasonable to think about using the bootstrap to develop confidence intervals. Harris and Kolen (1988) seem to discuss this, but I do not have access to the article, though their abstract suggests use of the usual approximation.

You could calculate all of these in any computer language or statistical package, probably. You could set it up in Excel.

For example, if I calculate the correlation between [1, 0] and [100.0, 20.0], I get a correlation of 1.0. Clearly, however, these results could be due to chance alone.

Well, with two points the correlation is only going to be 0 or 1 or undefined, I guess. But, your idea is right. If you are going to somehow be working with only very small samples all the time then it would be worth doing some digging for methodology that is tuned to that case.

Tate, R. F. (1954) Correlation between a discrete and continuous variable. Point-biserial correlation. Annals of Mathematical Statistics. Vol. 25, No. 3, pages 603--607.

Harris, D.J. and Kolen, M.J. (1988) Bootstrap and traditional standard errors of the point-biserial. Educational and Psychological Measurement 48.1: 43--51.