Solved – Should I include non-linear features in the linear regression model

feature selectionfeature-engineeringmultiple regression

I'm building my first linear regression model with multiple features (predicting house prices in a specific city). After reading up on ways to improve my model, I see people talking about plotting the relationship between the target variable and the features. I then realized that one of my features, the construction year of the house, is kind of "jumpy" which probably messes up the coefficient.

My question: How does one handle features as this one? Drop them? Transform them somehow? Turn them into categorial variables?

Chart below. Y axis is mean house price (in Swedish kronor) per year.

Price after construction year

Edit: Added plot of residuals below.
Residuals plot

Edi2: Added residual histogram below.Histogram of residuals

Best Answer

Your residual plot appears normal enough to use linear regression with Ordinary Least Squares (OLS) loss.

The linear in linear regression refers to the OLS loss function:

$$ \hat{Y_i} = \beta_{0} + \beta_{1} X_{i} + \epsilon_i $$

Which is linear in each term. It does not refer to the linearity of the independent variables which are being regressed against the dependent output.

If you are looking for a linear regression-like model that fits a non-linear equation, check out Support Vector Machines (SVM) with a polynomial kernel.

Related Question