Regression – Controlling for Non-Linear Variable in Non-Linear Modeling of Response

controlling-for-a-variablemachine learningmodelingnonlinear regressionregression

I need to model a continuous response variable $y$ based on continuous features $x_1, …, x_n$ while controlling for another continuous feature $x_c$. The intent is to understand how much an increase of 1 in one of the features (e.g. $x_1$) would on average impact $y$ while controlling for $x_c$.

If the relationships were linear, then including $x_c$ in a normal linear regression would be enough to control for $x_c$, but that's not the case. The data has the following patterns that makes it much more difficult to model:

  • The relationship between $y$ and $x_c$ is highly non-linear
  • The relationship between $y$ and $x_1, …, x_n$ is highly non-linear
  • Some of the features in $x_1, …, x_n$ are highly correlated with $x_c$

How would I best model $y$ using features $x_1, …, x_n$ while controlling for $x_c$?

Best Answer

The intent is to understand how much an increase of 1 in one of the features (e.g. $đť‘Ą_1$) would on average impact $y$ while controlling for $x_c$. (Emphasis added.)

First, given nonlinear associations of predictors with outcome, there isn't a unique answer. You have to specify a particular value of $x_1$ from which to evaluate the change in $y$ or a range of $x_1$ values over which you would average. If the nonlinearities involve interactions with other predictors, you would need to specify the levels of the interacting predictors too. Keep that in mind.

Second, nonlinear associations of predictors with outcome can often be analyzed empirically with a linear regression model if you have no theoretical model in mind. A particular form of polynomial approximation, restricted cubic regression splines, is a common choice. The regression is then still linear in the coefficients, so once the general form of the spline is specified (via methods in standard statistical software) linear regression fitting is all that is required.

Chapter 2 of Frank Harrell's course notes outlines that approach to modeling nonlinear relationships among variables (Section 2.4), including how to evaluate model fit and handle interactions among such predictors (Section 2.7). There are related approaches with penalized splines and generalized additive models, discussed in this thread.

Finally, as the comments indicate a potential interest in "feature importance," see Section 5.4 of Harrell's notes. The anova() function in his rms package can provide a measure of predictor importance that combines all nonlinear and interaction terms for a predictor, the difference between the partial $\chi^2$ for the predictor and the number of degrees of freedom. He uses analysis of multiple bootstrap samples to illustrate how unreliable such a measure can be.

Related Question