Solved – What are some of the most common misconceptions about linear regression

multiple regressionregression

I'm curious, for those of you who have extensive experience collaborating with other researchers, what are some of the most common misconceptions about linear regression that you encounter?

I think can be a useful exercise to think about common misconceptions ahead of time in order to

  1. Anticipate people's mistakes and be able to successful articulate why some misconception is incorrect

  2. Realize if I am harboring some misconceptions myself!

A couple of basic ones I can think of:

Independent/Dependent variables must be normally distributed

Variables must be standardized for accurate interpretation

Any others?

All responses are welcome.

Best Answer

False premise: A $\boldsymbol{\hat{\beta} \approx 0}$ means that there is no strong relationship between DV and IV.
Non-linear functional relationships abound, and yet data produced by many such relationships would often produce nearly zero slopes if one assumes the relationship must be linear, or even approximately linear.

Relatedly, in another false premise researchers often assume—possibly because many introductory regression textbooks teach—that one "tests for non-linearity" by building a series of regressions of the DV onto polynomial expansions of the IV (e.g., $Y \sim \beta_{0} + \beta_{X}X + \varepsilon$, followed by $Y \sim \beta_{0} + \beta_{X}X + \beta_{X^{2}}X^{2} + \varepsilon$, followed by $Y \sim \beta_{0} + \beta_{X}X + \beta_{X^{2}}X^{2} + \beta_{X^{3}}X^{3} + \varepsilon$, etc.). Just as straight line cannot well represent a non-linear functional relationship between DV and IV, a parabola cannot well represent literally an infinite number of nonlinear relationships (e.g., sinusoids, cycloids, step functions, saturation effects, s-curves, etc. ad infinitum). One may instead take a regression approach that does not assume any particular functional form (e.g., running line smoothers, GAMs, etc.).

A third false premise is that increasing the number of estimated parameters necessarily results in a loss of statistical power. This may be false when the true relationship is non-linear and requires multiple parameters to estimate (e.g., a "broken stick" function requires not only the intercept and slope terms of a straight line, but requires point at which slope changes and a how much slope changes by estimates also): the residuals of a misspecified model (e.g., a straight line) may grow quite large (relative to a properly specified functional relation) resulting in a lower rejection probability and wider confidence intervals and prediction intervals (in addition to estimates being biased).

Related Question