Correlation – Proof of Point-Biserial Correlation as a Special Case of Pearson Correlation

correlationpearson-r

I have been examining the use of the Point Biserial correlation as a statistic to measure the relationship between a dichotomous variable and a continuous one. Wikipedia et. al. seem to concur that the Point Biserial Correlation is a special case of the Pearson Correlation, but I cannot find a proof for this, algebraic or otherwise, and it is making me wary of using this in the context of the research I am doing (I need to do some statistical confidence testing afterwards). I have tried deriving the truth myself, but have chased everything round in a circle.

Any advice greatly appreciated.

Best Answer

Let the $n$ data consist of $n_0\gt 0$ $(x, 0)$ pairs and $n_1\gt 0$ $(x, 1)$ pairs. Their Pearson correlation coefficient will be the same as the reversed data consisting of corresponding $(0,x)$ and $(1,x)$ pairs. Because there are exactly two distinct values of the first coordinates, the regression line of the reversed data must pass through the mean points $(0,M_0)$ and $(1,M_1)$, whence it has slope $(M_1-M_0)/(1-0) = M_1-M_0$. The correlation coefficient is obtained by standardizing this: it must be multiplied by the standard deviation of the first coordinates and divided by the standard deviation of the second coordinates (the original $x$ values), written $s_n$. The standard deviation of the first coordinates is readily computed from the fact that they consist of $n_0$ zeros and $n_1$ ones; it equals

$$\sqrt{\frac{n_1}{n}\left(1-\frac{n_1}{n}\right)} = \sqrt{\frac{n_0n_1}{n^2}}.$$

Consequently the Pearson correlation coefficient is

$$r = \frac{M_1-M_0}{s_n}\sqrt{\frac{n_0n_1}{n^2}},$$

which is precisely the Wikipedia formula for the point-biserial coefficient.

Figure: x vs. y with linear regression line and mean points shown

The heights of the red dots depict the mean values $M_0$ and $M_1$ of each vertical strip of points. The dashed gray line is the regression line.