# Regression – Importance of Transformations in Polynomial Regression Over Linear Models

data transformationlinear modelrregression

When performing linear regression, why might you choose to transform one of your predictor variables over using polynomial regression? Essentially what are the advantages (if any) of performing a transformation over using polynomial regression?

As an example I'm using the Auto dataset provided by ISLR2 package:

require(ISLR2)

df <- Auto
df$$horsepower <- log(Auto$$horsepower, 2)

model.tran <- lm(mpg ~ horsepower, df)
model.poly <- lm(mpg ~ poly(horsepower, 2), Auto)

summary(model.tran)
summary(model.poly)

ggplot(Auto, aes(horsepower, mpg))+
geom_point()+
geom_line(aes(y=predict(model.tran, df)), col="blue")+
ggtitle("Log Transformed")

ggplot(Auto, aes(horsepower, mpg))+
geom_point()+
geom_line(aes(y=predict(model.poly, Auto)), col="blue")+
ggtitle("Polynomial 2 degrees")



output:

Call:
lm(formula = mpg ~ horsepower + horsepower, data = df)

Residuals:
Min       1Q   Median       3Q      Max
-14.2299  -2.7818  -0.2322   2.6661  15.4695

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 108.6997     3.0496   35.64   <2e-16 ***
horsepower  -12.8802     0.4595  -28.03   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.501 on 390 degrees of freedom
Multiple R-squared:  0.6683,    Adjusted R-squared:  0.6675
F-statistic: 785.9 on 1 and 390 DF,  p-value: < 2.2e-16

Call:
lm(formula = mpg ~ poly(horsepower, 2), data = Auto)

Residuals:
Min       1Q   Median       3Q      Max
-14.7135  -2.5943  -0.0859   2.2868  15.8961

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)            23.4459     0.2209  106.13   <2e-16 ***
poly(horsepower, 2)1 -120.1377     4.3739  -27.47   <2e-16 ***
poly(horsepower, 2)2   44.0895     4.3739   10.08   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.374 on 389 degrees of freedom
Multiple R-squared:  0.6876,    Adjusted R-squared:  0.686
F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16



As can be seen, the R squared value is higher in the polynomial model. The residual standard error is lower. And residual vs fitted plots indicate residuals variance is slightly more constant in the polynomial regression.

This Answer here https://stats.stackexchange.com/a/287475/353359 mentions implications for the model outside of the x-range and variation from the model. Is this simply a risk of overfitting?

While I have included an example, my question is more general. When would you want to use transformations if polynomial regressions seems like a catch-all for non-linearity.

In theory, the implicit assumption in regression is that you know the shape of the function---linear, quadratic, logarithmic etc.---and just need to find its parameters by fitting it to the data. In practice, of course, this is seldom the case. You can fit an infinite number of functions and obtain marvellous results on your training data, but such models will likely perform poorly in the real world.

So, when choosing the model, nothing beats domain knowledge. In case of your dataset, an automotive engineer or a physicist might be in the position to suggest a realistic model. But, even with a cursory knowledge of physics we might be able to exclude many candidates. For example:

• Can fuel efficiency of a car ever be negative?
• Is it plausible for the efficiency to rise with the engine power (hat tip to @Henry)?
• Can it rise infinitely with any change of power?

etc.

The first two points obviously speak against a polynomial model, including a linear one ("linear" refering to predictors). The third speaks against $$1/x$$, which otherwise would perhaps seem plausible.

Among the infinite number of the functions passing the above plausibility check, the exponential decrease is probably the simplest model you can use.