Solved – Multiple regression – how to deal with mixed linear and non-linear variables

data transformationmulticollinearityregression

Say I have a bunch of explanatory variables to predict a continuous independent variable.
Below, a simple toy example:

enter image description here

I think it would be easiest to do a log-log transform and proceed with linear regression. Looks like that the explanatory variables are relatively highly correlated, and there might be a high collinearity between them. This, I would take care of later via e.g.,

  • Lasso regression (or Ridge)
  • feature selection algos
  • Partial Least Squares
  • Decision trees and feature importance
  • Dimensionality reduction via principal component analysis

But back to the log-log-transform; now, the data looks like this:

enter image description here

To me, it looks like that x2, x3, and x4 are now better suited for a linear regression. However, x1 does not look "very linear" anymore. How would I best deal with x1 before I proceed?

Best Answer

I'm not sure what your plots are of since the axes are not properly labeled. However, if your goal is simply to predict $y$ and get the best predictions are possible, and there is no need to interpret the coefficients, why not simply fit the the model with all your variables and then validate the data with a holdout sample. Select the one with the smallest predicted error. Don't worry about your model assumptions so much if your only goal is to predict. It would be good to get a better idea of exactly what you want to do with your model, however, before a "correct" answer could be given to your question.