Solved – Deciding between a linear regression model or non-linear regression model

hypothesis testingnonlinear regressionpredictive-modelsregression

How should one decide between using a linear regression model or non-linear regression model?

My goal is to predict Y.

In case of simple $x$ and $y$ dataset I could easily decide which regression model should be used by plotting a scatter plot.

In case of multi-variant like $x_1,x_2,…x_n$ and $y$. How can I decide which regression model has to be used? That is, How will I decide about going with simple linear model or non linear models such as quadric, cubic etc.

Is there any technique or statistical approach or graphical plots to infer and decide which regression model has to be used?

Best Answer

This is a realm of statistics called model selection. A lot of research is done in this area and there's no definitive and easy answer.

Let's assume you have $X_1, X_2$, and $X_3$ and you want to know if you should include an $X_3^2$ term in the model. In a situation like this your more parsimonious model is nested in your more complex model. In other words, the variables $X_1, X_2$, and $X_3$ (parsimonious model) are a subset of the variables $X_1, X_2, X_3$, and $X_3^2$ (complex model). In model building you have (at least) one of the following two main goals:

  1. Explain the data: you are trying to understand how some set of variables affect your response variable or you are interested in how $X_1$ effects $Y$ while controlling for the effects of $X_2,...X_p$
  2. Predict $Y$: you want to accurately predict $Y$, without caring about what or how many variables are in your model

If your goal is number 1, then I recommend the Likelihood Ratio Test (LRT). LRT is used when you have nested models and you want to know "are the data significantly more likely to come from the complex model than the parsimonous model?". This will give you insight into which model better explains the relationship between your data.

If your goal is number 2, then I recommend some sort of cross-validation (CV) technique ($k$-fold CV, leave-one-out CV, test-training CV) depending on the size of your data. In summary, these methods build a model on a subset of your data and predict the results on the remaining data. Pick the model that does the best job predicting on the remaining data according to cross-validation.