Regression – Understanding the Effect of Log Transformation on Pearson Correlation and Data Insight

regression

I have a data set where some x input has a 0.59 Pearson correlation with variable y (sample size is about 300). After performing simple linear regression I get the following standardized residuals scatter plot and qq plot:

Residuals scatter plot

Residuals QQ plot

Regression result and scatter plot relating x to y:

Regression result and scatter plot relating x to y

The residuals scatter plot seems fairly random with no discernible pattern to me, but the QQ plot shows a slight curvature and Shapiro-Wilk test rejects the null hypothesis so I assumed there might be some non linear relationship between x and y. I tried then applying a few transformations like sqrt(x), log(x) or x^2 to the input vector and to my surprise the Pearson correlation barely changes.

With x^2 it goes down to 0.58, for log(x) and sqrt(x) the correlation increases by about 0.002 and 0.003 compared to just x. Performing regression on the transformed variable also yields very similar results to the non transformed variable and the QQ plot still shows the same slight curvature.

At first I thought there was a mistake in my code, but I checked it several times and I'm fairly confident that's not the case, so now I'm wondering what kind of conclusions can I take away from this. Maybe y is related to x through a combination of linear and non linear terms? Maybe some other unknown variable that is collinear to x but has non linear relation to y is the cause of the curvature in the QQ plot? Maybe there's some other variable transformation I haven't tried that could explain it? What other method could I use to further explore the relationship between the two variables?

Best Answer

Don't be surprised that Pearson correlation coefficients aren't greatly affected by log or square-root or square or similar simple monotonic transformations of $X$ in your case. The rank orders aren't affected by the transformation, meaning that non-parametric correlations aren't affected. In terms of the Pearson correlation:

$$\rho_{X,Y}=\frac{\operatorname{E}[(X-\mu_X)(Y-\mu_Y)]}{\sigma_X\sigma_Y} $$

when $X$ is transformed and $Y$ isn't, it's a matter of how much the transformation of $X$ moves values above and below the mean in the transformed scale, relative to the corresponding change in $\sigma_X$. The net effect can be very small. The sample Pearson correlation is biased and not robust, further complicating attempts at intuition in practice.

Consider the following bivariate normal data in R; you need to have the mvtnorm package available:

set.seed(303)
norm2 <- mvtnorm::rmvnorm(300,mean=c(1,1),sigma=matrix(c(.1,-.059,-.059,.1),byrow=TRUE,nrow=2))
norm2 <- data.frame(norm2)
names(norm2) <- c("x","y")
norm2$y <- 35 * norm2$y ## get roughly into reported ranges of values
plot(y~x,data=norm2) ## not shown, similar to cloud in the plot from the question, no outlier
with(norm2,cor(y,x))
# [1] -0.5559027
with(norm2,cor(y,x^2))
# [1] -0.5400946
with(norm2,cor(y,log(x)))
# [1] -0.5426008
with(norm2,cor(y,sqrt(x)))
# [1] -0.5546897

If you don't have a solid theoretical reason for a particular transformation, it's often good to let the data suggest the functional form of the relationship by modeling the continuous predictor variable with a regression spline.

Also, see the discussion on: Is normality testing 'essentially useless'? I would worry more about the potential high-leverage point noted by @dipetkov, but you need to apply your knowledge of the subject matter.

Related Question