Why normalize the vectors to calculate the Pearson correlation coefficient

correlationpearson-rr-squaredregression

I learned from this answer that the correlation $R$ is $\cos(\theta)$ and $\theta$ is the angle between a dependent vector $Y$ and an independent vector $X$, but I learned from this article that the two vectors are normalized (by subtracting their corresponding means) ones.

I believe the normalized ones are correct for two reasons: 1) the unnormalized version doesn't make the correlation invariant to scaling and shift; 2) the results of the normalized version is exactly the correlation formula:

$$R=\frac{1}{n-1}\sum_{i=1}^n\frac{x_i-\bar x}{s_x}\frac{y_i-\bar y}{s_y}$$

But why should the two vectors be normalized?

Best Answer

The practical difference between the centered (normalised) and uncentered version is, that for the Pearson product-moment correlation coefficient it is possible to construct a hypothesis test with the null hypothesis:

H0: rho = 0
Ha: rho !=0

Note, that the uncentered version does exist and is called Tucker's congruence coefficient (despite having first been published by Cyril Burt in 1948). Moreover the geometrical meaning of the Pearson and the Tucker coefficients is the same.

References

Burt, C. (1948). The factorial study of temperamental traits. British Journal of Mathematical and Statistical Psychology, 1(3), 178–203.

Tucker, L. R (1951). A method for synthesis of factor analysis studies (No. PRS-984). Prince- ton: Educational Testing Service.

Related Solutions

Solved – Issues on computing Pearson correlation coefficient for two vectors

Hi this should not be a problem since the mean is explicitly subtracted. Here's a small example (all codes in r):

require(mnormt)
#We create a multivariate Normal random variable
df<-rmnorm(n = 100, mean = rep(0, 2), matrix(c(1,0.5,0.5,1),nrow=2)) 

#We compute the correlation
cor(df)
        [,1]      [,2]
 [1,] 1.0000000 0.5605498
 [2,] 0.5605498 1.0000000

#We scale the first variable by 1000
df[,1] <- df[,1]*10000

#The correlation stays the same
cor(df)
         [,1]      [,2]
 [1,] 1.0000000 0.5605498
 [2,] 0.5605498 1.0000000

Hope this helps.

Edit Follow up to the comments (thanks to whuber): I did understand the question as being related to the magnitude of the whole vector. I understand from the discussion that some understood the question as being related to outliers. In this case my solution is, of course, not helpful.

Solved – What’s the formula of normalized correlation

I haven't come across this usage, but it seems easy to decode.

Matters may differ in your field, but within mainstream statistics, and all statistics-using disciplines I know about, correlation is understood as being by definition scaled to fall within [-1, 1]. When calculated similarly to your formula correlation is a cosine.

So the term "normalized" is just emphasizing that fact; it is not flagging a special case.

The unnormalized correlation would just be called the covariance.

So, you can't find this term being used because it is very unusual.

Best Answer

Related Solutions

Solved – Issues on computing Pearson correlation coefficient for two vectors

Solved – What’s the formula of normalized correlation

Related Question