Regression – Understanding the Effect of Log Transformation on Pearson Correlation and Data Insight

regression

I have a data set where some x input has a 0.59 Pearson correlation with variable y (sample size is about 300). After performing simple linear regression I get the following standardized residuals scatter plot and qq plot:

Regression result and scatter plot relating x to y:

The residuals scatter plot seems fairly random with no discernible pattern to me, but the QQ plot shows a slight curvature and Shapiro-Wilk test rejects the null hypothesis so I assumed there might be some non linear relationship between x and y. I tried then applying a few transformations like sqrt(x), log(x) or x^2 to the input vector and to my surprise the Pearson correlation barely changes.

With x^2 it goes down to 0.58, for log(x) and sqrt(x) the correlation increases by about 0.002 and 0.003 compared to just x. Performing regression on the transformed variable also yields very similar results to the non transformed variable and the QQ plot still shows the same slight curvature.

At first I thought there was a mistake in my code, but I checked it several times and I'm fairly confident that's not the case, so now I'm wondering what kind of conclusions can I take away from this. Maybe y is related to x through a combination of linear and non linear terms? Maybe some other unknown variable that is collinear to x but has non linear relation to y is the cause of the curvature in the QQ plot? Maybe there's some other variable transformation I haven't tried that could explain it? What other method could I use to further explore the relationship between the two variables?

Best Answer

Don't be surprised that Pearson correlation coefficients aren't greatly affected by log or square-root or square or similar simple monotonic transformations of $X$ in your case. The rank orders aren't affected by the transformation, meaning that non-parametric correlations aren't affected. In terms of the Pearson correlation:

$$\rho_{X,Y}=\frac{\operatorname{E}[(X-\mu_X)(Y-\mu_Y)]}{\sigma_X\sigma_Y} $$

when $X$ is transformed and $Y$ isn't, it's a matter of how much the transformation of $X$ moves values above and below the mean in the transformed scale, relative to the corresponding change in $\sigma_X$. The net effect can be very small. The sample Pearson correlation is biased and not robust, further complicating attempts at intuition in practice.

Consider the following bivariate normal data in R; you need to have the mvtnorm package available:

set.seed(303)
norm2 <- mvtnorm::rmvnorm(300,mean=c(1,1),sigma=matrix(c(.1,-.059,-.059,.1),byrow=TRUE,nrow=2))
norm2 <- data.frame(norm2)
names(norm2) <- c("x","y")
norm2$y <- 35 * norm2$y ## get roughly into reported ranges of values
plot(y~x,data=norm2) ## not shown, similar to cloud in the plot from the question, no outlier
with(norm2,cor(y,x))
# [1] -0.5559027
with(norm2,cor(y,x^2))
# [1] -0.5400946
with(norm2,cor(y,log(x)))
# [1] -0.5426008
with(norm2,cor(y,sqrt(x)))
# [1] -0.5546897

If you don't have a solid theoretical reason for a particular transformation, it's often good to let the data suggest the functional form of the relationship by modeling the continuous predictor variable with a regression spline.

Also, see the discussion on: Is normality testing 'essentially useless'? I would worry more about the potential high-leverage point noted by @dipetkov, but you need to apply your knowledge of the subject matter.

Related Solutions

Solved – Residuals analysis: interpretation of a scatter plot

No, this does not look good. You appear to have a problem with heteroscedasticity as there is increasing variance of residuals with increasing predicted values. Constant variance is an important condition for OLS regression in order to perform valid inference. This might be resolved by log-transforming the response variable.

There is also a hint of autocorrelation but this is hard to assess with so few data points.

Edit, after downloading the data:

Log-transforming C helps with heteroscedasticity, though there are few data points so I would advise some caution: while it seems to help with these data, it may not be the case with more observations. There could be other non-linearities that should be accounted for.

However, all your independent variables are highly correlated with each other, which is not good at all for model interpretation:

      years    Y    W  SSW    G    T   TR    D
years  1.00 0.95 0.96 0.96 0.98 0.98 1.00 0.98
Y      0.95 1.00 0.99 0.95 0.97 0.98 0.95 0.87
W      0.96 0.99 1.00 0.97 0.98 0.98 0.96 0.89
SSW    0.96 0.95 0.97 1.00 0.98 0.97 0.97 0.93
G      0.98 0.97 0.98 0.98 1.00 0.99 0.99 0.95
T      0.98 0.98 0.98 0.97 0.99 1.00 0.98 0.93
TR     1.00 0.95 0.96 0.97 0.99 0.98 1.00 0.98
D      0.98 0.87 0.89 0.93 0.95 0.93 0.98 1.00

Solved – Polynomial fit: removing outliers

In the picture, you posted, outlier is on the x axis. We can remove them using IQR and example code of doing it in R can be found here

Here is an example on simulated data for your case:left subfigure is the data without outlier, the right subfigure is the data with outlier. (I am manually adding 3 data points in mtcars data.)

As you can see, those 3 data points make the regression line flat.

Code

par(mfrow=c(1,2))
d=mtcars[,c("wt","mpg")]
plot(d)
fit=lm(mpg~wt,d)
summary(fit)
abline(fit)

d2=rbind(d,c(40,20),c(45,20),c(50,20))
plot(d2)
fit2=lm(mpg~wt,d2)
summary(fit2)
abline(fit2)

Best Answer

Related Solutions

Solved – Residuals analysis: interpretation of a scatter plot

Solved – Polynomial fit: removing outliers

Related Question