As we know that Pearson's correlation measures the linearity between two variables, I am wondering when applying normalization and transformation on the original dataset, does the normalization and transformation method needs to be a linear method, in order to not effect the correlation results?
More specifically, for example, after log-transformation, does the correlation change? I think so because log-transformation changed the distribution of the data and the original linear relationship is scaled by the log() function.
To extend the question, I am wondering if anyone could provide some general summary and comments on what normalization/transformation can be used when you want to scale the data but do not want to change the original relationship among your variables described by correlation metric?
Thanks!
You are mostly welcome to refine and shape my question if it's not accurate.
Best Answer
Pearson's correlation measures the linear component of association. So you are correct that linear transformations of data will not affect the correlation between them. However, nonlinear transformations will generally have an effect.
Here is a demonstration: Generate right-skewed, correlated data vectors
x
andy
. Pearson's correlation is $r = 0.987.$ (The correlation of $X$ and $Y^\prime = 3 + 5Y$ is the same.)However, if the second variable is log-transformed, Pearson's correlation changes to $r = 0.862.$
Here are the corresponding plots:
By contrast, Spearman's correlation is unaffected by the (monotone increasing) log-transformation. Spearman's correlation is based on ranks of observations and log-transformation does not change ranks. Before and after transformation, $r_S = 0.966.$