Regression – Why Higher R-squared After Log Transformation

data transformationlogarithmregression

lately I lost access to SPSS and instead of using Python or R, I tend to perform analysis using a free software called Jamovi.

The thing is, this software doesn't have the different non-linear regression models I liked to use in SPSS, so I tend to make only linear-regression models.

Because of this I became used to perform log transforms in the data prior to making the model. I noticed that most of the time, after transforming a predictor in the linear regression model, the R squared increased.

I know that log transform can reduce the skewness of the data, but I wonder why it always seems to improve the linear-regression model (only using the log transformed variable as a predictor, not the original variable).

Is this a true improvement or is it some kind of statistical artifact? Why does this happen? Should I use it every time?

Best Answer

There's no reason why this would always improve your $R^2$, most likely it is because you are forcing a linear fit where it isn't ideal and your transformation 'linearizes' the predictor somewhat.

Here's a quick counterexample where the data generating mechanism is in fact linear:

set.seed(1)
x <- runif(1e2)
y <- x + rnorm(1e2, 0, 0.1)

summary(fit <- lm(y ~ x))$r.squared           # 0.897
summary(lfit <- lm(y ~ log(x)))$r.squared     # 0.736

The first 100,000 seeds all show a better $R^2$ for the untransformed predictor.