Pearson’s Correlation – Does Data Normalization and Transformation Change the Pearson’s Correlation?

correlationdata transformationnormalizationpearson-r

As we know that Pearson's correlation measures the linearity between two variables, I am wondering when applying normalization and transformation on the original dataset, does the normalization and transformation method needs to be a linear method, in order to not effect the correlation results?

More specifically, for example, after log-transformation, does the correlation change? I think so because log-transformation changed the distribution of the data and the original linear relationship is scaled by the log() function.

To extend the question, I am wondering if anyone could provide some general summary and comments on what normalization/transformation can be used when you want to scale the data but do not want to change the original relationship among your variables described by correlation metric?

Thanks!

You are mostly welcome to refine and shape my question if it's not accurate.

Best Answer

Pearson's correlation measures the linear component of association. So you are correct that linear transformations of data will not affect the correlation between them. However, nonlinear transformations will generally have an effect.

Here is a demonstration: Generate right-skewed, correlated data vectors x and y. Pearson's correlation is $r = 0.987.$ (The correlation of $X$ and $Y^\prime = 3 + 5Y$ is the same.)

set.seed(2019)
x = rexp(100, .1);  y = x + rexp(100, .5)
cor(x, y)
[1] 0.987216
cor(x, 3 + 5*y)
[1] 0.987216     # no change with linear transf of 'y'

However, if the second variable is log-transformed, Pearson's correlation changes to $r = 0.862.$

cor(x, log(y))
[1] 0.8624539

Here are the corresponding plots:

enter image description here

By contrast, Spearman's correlation is unaffected by the (monotone increasing) log-transformation. Spearman's correlation is based on ranks of observations and log-transformation does not change ranks. Before and after transformation, $r_S = 0.966.$

cor(x, y, meth="spear")
[1] 0.9655446
cor(rank(x), rank(log(y)))
[1] 0.9655446   # Spearman again

cor(x, log(y), meth="spear")
[1] 0.9655446