a <- c(10,11)
b <- c(2,3)
The covariance (C
) betweena
and b
=
sum ( (a-mean(a)) * (b-mean(b)) ) # equal to cov(a,b) in this case as n=2
The sample covariance (Csamp
) betweena
and b
=
sum ( (a-mean(a)) * (b-mean(b)) ) / n-1 # equal to cov(a,b)
Now lets say the sum of square differences for a
= SSa
. This means the Variance(V
) for a
= SSa/n
and the sample variance(Vsamp
) is SSa/n-1
. The sample standard deviation (SDsamp
) for a
would be sqrt(SSa/n-1)
The correlation coeffictent equations as I have found it are:
(1) C / sqrt(SSa) * sqrt(SSb)
# or
(2) C / SDsamp_a * SDsamp_b
# (1) uses sqrt(SSa), not V or V samp and (2) uses SDsamp not V and does not use Csamp
I am confused over how to calculate the correlation coefficient (CF
) and the sample correlation coefficient (CFsamp
). The two formulas above do not seem the same to me. can someone explain how to calculate CF
and CFsamp
?
Best Answer
The pearson product correlation coefficient is defined as:
$$r=\frac{cov(X,Y)}{\sigma_X\sigma_Y}.$$
In order to estimate it from the sample, you put in the sample estimates of covariance and standard deviation:
$$r=\frac{\frac{1}{n-1}\sum{(X_i-\bar{X})(Y_i-\bar{Y})}}{\sqrt{\frac{1}{n-1}\sum(X_i-\bar{X})^2}\sqrt{\frac{1}{n-1}\sum(Y_i-\bar{Y})^2}}$$
Fortunately, $n-1$ is cancelled out and you get
$$r=\frac{\sum{(X_i-\bar{X})(Y_i-\bar{Y})}}{\sqrt{\sum(X_i-\bar{X})^2}\sqrt{\sum(Y_i-\bar{Y})^2}}$$
Due to this last step it doesn't matter for the correlation coefficient if you divide by $n$ or by $(n-1)$ when calculating the covariance and standard deviations.