Linear Regression – Understanding the Linearity Assumption and Its Importance

assumptionslinearmachine learningregression

According to this website, if the scatter plot follows a linear pattern (i.e. not a curvilinear pattern) then linearity assumption is met.

Here is an example where the assumption is not met.

enter image description here

But as far as I know, the only real requirement is that the data must be linear in the unknown coefficients which means that we can have parabola shape and still be linear. Thus, we are not violating the linearity assumption.

Did I make an error of understanding ?

Best Answer

You still need to have a function or functions of the original variable(s) that the response is linear in.

You're correct that linear regression is linear in the coefficients, but then it's equally linear in the things the coefficients are multiplied by. (Where here we're talking in the sense of a linear map, rather than "has a straight-line relationship", though the two are related concepts when you have a constant term included in the predictors.)

For multiple regression we write $E(Y|\mathbf{x})= X\beta$, where $X$ is the matrix of variables as actually supplied to the regression (and the constant). This is linear in $\beta$ but it's equally linear in the columns of $X$.

In the case of simple regression, if for example you can write an equation $Y = \beta_0+\beta_1 t(x) + \epsilon$, or $E(Y|x)=\beta_0+\beta_1 t(x)$ that's linear in the supplied variables $(1,x^*)$, where $x^*=t(x)$.

If you know a $t(x)$ to supply to the regression, that means you don't need to have a straight-line relationship between $y$ and $x$, but there's still a linear relationship.

There's a variety of approaches that will model nonlinear relationships with linear equations, including polynomials, various kinds of regression splines, trigonometric functions, and so forth, that can have this property of still being (multiple) linear regression models.

Related Question