Solved – Calculating correlation coefficient

correlationcovariancer

a <- c(10,11)
b <- c(2,3)

The covariance (C) betweena and b =

sum ( (a-mean(a)) * (b-mean(b)) ) # equal to  cov(a,b) in this case as n=2

The sample covariance (Csamp) betweena and b =

sum ( (a-mean(a)) * (b-mean(b)) ) / n-1 # equal to  cov(a,b)

Now lets say the sum of square differences for a = SSa. This means the Variance(V) for a = SSa/n and the sample variance(Vsamp) is SSa/n-1. The sample standard deviation (SDsamp) for a would be sqrt(SSa/n-1)

The correlation coeffictent equations as I have found it are:

(1) C  /  sqrt(SSa) * sqrt(SSb)

# or

(2)  C  /  SDsamp_a * SDsamp_b

# (1) uses sqrt(SSa), not V or V samp and (2) uses SDsamp  not V and does not use Csamp

I am confused over how to calculate the correlation coefficient (CF) and the sample correlation coefficient (CFsamp). The two formulas above do not seem the same to me. can someone explain how to calculate CF and CFsamp?

Best Answer

The pearson product correlation coefficient is defined as:

$$r=\frac{cov(X,Y)}{\sigma_X\sigma_Y}.$$

In order to estimate it from the sample, you put in the sample estimates of covariance and standard deviation:

$$r=\frac{\frac{1}{n-1}\sum{(X_i-\bar{X})(Y_i-\bar{Y})}}{\sqrt{\frac{1}{n-1}\sum(X_i-\bar{X})^2}\sqrt{\frac{1}{n-1}\sum(Y_i-\bar{Y})^2}}$$

Fortunately, $n-1$ is cancelled out and you get

$$r=\frac{\sum{(X_i-\bar{X})(Y_i-\bar{Y})}}{\sqrt{\sum(X_i-\bar{X})^2}\sqrt{\sum(Y_i-\bar{Y})^2}}$$

Due to this last step it doesn't matter for the correlation coefficient if you divide by $n$ or by $(n-1)$ when calculating the covariance and standard deviations.