Solved – B-Splines VS high order polynomials in regression

multiple regressionpolynomialregressionregularizationsplines

I do not have a specific example or task in mind. I'm just new on using b-splines and I wanted to get a better understanding of this function in the regression context.

Let's assume that we want to assess the relationship between the response variable $y$ and some predictors $x_1, x_2,…,x_p$. The predictors include some numerical variables as well as some categorical ones.

Let's say that after fitting a regression model, one of the numerical variables e.g $x_1$ is significant. A logical step afterwards is to assess whether higher order polynomials e.g: $x_1^2$ and $x_1^3$ are required in order to adequately explain the relationship without overfitting.

My questions are:

  1. At what point do you chose between b-splines or simple higher order polynomial. e.g in R:

    y ~ poly(x1,3) + x2 + x3
    

    vs

     y ~ bs(x1,3) + x2 + x3
    
  2. How can you use plots to inform your choice between those two and what happens if it's not really clear from the plots (e.g: due to massive amounts of data points)

  3. How would you assess the two-way interaction terms between $x_2$ and let's say $x_3$

  4. How do the above change for different types of models

  5. Would you consider to never use high order polynomials and always fitting b-splines and penalise the high flexibility?

Best Answer

I would usually only consider splines rather than polynomials. Polynomials cannot model thresholds and are often undesirably global, i.e., observations at one range of the predictor have a strong influence on what the model does at a different range (Magee, 1998, The American Statistician and Frank Harrell's Regression Modeling Strategies). And of course restricted splines which are linear outside the extremal knots are better for extrapolation, or even intrapolation at extreme values of the predictors.

One case where you may want to consider polynomials is when it is important to explain your model to a nontechnical audience. People understand polynomials better than splines. (Edit: Matthew Drury points out that people may only think they understand polynomials better than splines. I won't take sides on this question.)

Plots are often not very useful in deciding between different ways of dealing with nonlinearity. Better to do cross-validation. This will also help you assess interactions, or find a good penalization.

Finally, my answer doesn't change with the kind of model, because the points above are valid for any statistical or ML model.