Solved – Derivation of the standard error for Pearson’s correlation coefficient

correlationpearson-rstandard error

I am wondering how to derive the formula for the standard error of Pearson's correlation coefficient which is given in Zar for example as

$$
\newcommand{\cov}{{\rm Cov}}
\newcommand{\var}{{\rm Var}}
\newcommand{\sd}{{\rm SD}}
SE_r =\sqrt{\frac{1-r^2}{n-2}}$$

I tried to get it from estimating the variance of r when

$$r =\frac{\cov(x,y)}{\sd(x)\sd(y)}$$

and $V(X) = E(X^2) – E(X)^2$ so we get $Var(r) = E\bigg(\frac{\cov(x,y)^2}{\var(x)\var(y)}\bigg) – r^2$. But from here I don't know how to continue since $E\bigg(\frac{\cov(x,y)^2}{\var(x)\var(y)}\bigg)$ would have to be $\frac{1-(n-3)r^2}{n-2}$ to get finally to

$$\var(r) =\frac{1-r^2}{n-2}$$

Any suggestions or references where I could look this up?

Best Answer

After looking for a long time for an answer to this same question, I found a couple interesting links https://www.jstor.org/stable/2277400?seq=1#page_scan_tab_contents

where we can only see the first page but that's where the derivation is. The "standard deviation by dr Sheppard" is given by something called the Asymptotic distribution of moments, of which you can see a bit here

https://books.google.com/books?id=Uc9C90KKW_UC&pg=PA126&lpg=PA126&dq=Mst+pearson+Sheppard&source=bl&ots=Kvw0xTLzps&sig=pyHVB_ybjsnb_0QOBDHST6SRi-M&hl=en&sa=X&ved=0ahUKEwimjvjQ8NnSAhWEppQKHRqbC1sQ6AEIIjAD#v=onepage&q=Mst%20pearson%20Sheppard&f=false

The reason for the "n-2" instead of "n" in the root, is that your formula assumes a t-distribution with n-2 degrees of freedom, while the one in the links assumes a normal distribution.

Code

The following R code generates data, uses the preceding formulas to compute $b$, $s^2$, and the diagonal of $V$ from only the means and covariances of the data (along with the values of $n$ and $p$ of course), and compares them to standard least-squares output derived from the data. In all examples I have run (including multiple regression with $p\gt 1$) agreement is exact to the default output precision (about seven decimal places).

For simplicity--to avoid doing essentially the same set of operations three times--this code first combines all the summary data into a single matrix v and then extracts $X^\prime X$, $X^\prime Y$, and $Y^\prime Y$ from its entries. The comments note what is happening at each step.

n <- 24
p <- 3
beta <- seq(-p, p, length.out=p)# The model
set.seed(17)
x <- matrix(rnorm(n*p), ncol=p) # Independent variables
y <- x %*% beta + rnorm(n)      # Dependent variable plus error
#
# Compute the first and second order data summaries.
#
m <- rep(0, p+1)                # Default means
m <- colMeans(cbind(x,y))       # If means are available--comment out otherwise
v <- cov(cbind(x,y))            # All variances and covariances
# 
# From this point on, only the summaries `m` and `v` are used for the calculations
# (along with `n` and `p`, of course).
#
m <- m * n                      # Compute column sums
v <- v * (n-1)                  # Recover sums of squares of residuals
v <- v + outer(m, m)/n          # Adjust to obtain the sums of squares
v <- rbind(c(n, m), cbind(m, v))# Border with the sums and the data count
xx <- v[-(p+2), -(p+2)]         # Extract X'X
xy <- v[-(p+2), p+2]            # Extract X'Y
yy <- v[p+2, p+2]               # Extract Y'Y
b <- solve(xx, xy)              # Compute the coefficient estimates
s2 <- (yy - b %*% xy) / (n-p-1) # Compute the residual variance estimate
#
# Compare to `lm`.
#
fit <- summary(lm(y ~ x))
(rbind(Correct=coef(fit)[, "Estimate"], From.summary=b))    # Coeff. estimates
(c(Correct=fit$sigma, From.summary=sqrt(s2)))               # Residual SE
#
# The SE of the intercept will be incorrect unless true means are provided.
#
se <- sqrt(diag(solve(xx) * c(s2))) # Remove `diag` to compute the full var-covar matrix
(rbind(Correct=coef(fit)[, "Std. Error"], From.summary=se)) # Coeff. SEs

Best Answer

Related Solutions

Solved – Basis of Pearson correlation coefficient

Regression – Calculating Standard Error of Regression Coefficient Without Raw Data

Code

Related Question