I would usually only consider splines rather than polynomials. Polynomials cannot model thresholds and are often undesirably global, i.e., observations at one range of the predictor have a strong influence on what the model does at a different range (Magee, 1998, The American Statistician and Frank Harrell's Regression Modeling Strategies). And of course restricted splines which are linear outside the extremal knots are better for extrapolation, or even intrapolation at extreme values of the predictors.
One case where you may want to consider polynomials is when it is important to explain your model to a nontechnical audience. People understand polynomials better than splines. (Edit: Matthew Drury points out that people may only think they understand polynomials better than splines. I won't take sides on this question.)
Plots are often not very useful in deciding between different ways of dealing with nonlinearity. Better to do cross-validation. This will also help you assess interactions, or find a good penalization.
Finally, my answer doesn't change with the kind of model, because the points above are valid for any statistical or ML model.
When a LASSO model includes a categorical predictor with more than 2 levels, you usually want to ensure that all levels of the predictor are selected together as with the group LASSO. When a LASSO model includes interaction terms, it's important to maintain the hierarchy of the interactions. That is, if LASSO selects an interaction term it should also select the terms of the individual predictors contributing to the interaction. That's discussed briefly here and with more rigor by Bien, Taylor and Tibshirani in "A lasso for hierarchical interactions", Ann. Stat. 41; 1111–1141, 2013.
For your questions 1 and 3, Bien, Taylor and Tibshirani seem to deal directly with your question:
It is common with the lasso to standardize the predictors so that they are on the same scale. In this paper, we standardize X [matrix of individual predictors] so that its columns have mean 0 and standard deviation 1; we then form Z [matrix of interaction terms] from these standardized predictors and, finally, center the resulting columns of Z.
As the quadratic terms in your model are essentially self-interactions it would seem that you would be advised to proceed similarly. That is, standardize the continuous predictors $x_1$ and $x_2$ (subtract mean, divide by standard deviation), form the polynomial and interaction terms from the standardized predictors, then only center the polynomial and interaction terms. (As I understand it the centering of the interactions isn't necessary but does simplify interpretations of coefficients.) The corresponding R hierNet package by Bien and Tibshirani provides those choices as defaults: center features, standardize main effects, and don't standardize interactions. The hierNet()
function does allow for other choices, if you want to play with other possibilities.
With respect to question 2, as noted in a comment it's not clear whether or how best to standardize a categorical predictor, particularly with more than 2 levels. Provided you handle it with group LASSO and respect the hierarchy of interactions, however, there isn't any problem in "deal[ing] with the interaction terms of categorical and continuous variables." If you choose treatment coding of the categorical predictor then the coefficients of the continuous predictors and their interactions with each other represent those values when the categorical predictor is at its reference level. The corresponding interaction terms with the other levels of the predictor are the differences of the coefficients from those values for the reference level. I see nothing to be gained by incorporating powers of the dummy variables representing the categorical predictor.
With respect to question 4, the "alternating signs" in interaction values after centering are features, not bugs. See this page for example. Leave them alone.
Best Answer
In addition to @mkt's excellent answer, I thought I would provide a specific example for you to see so that you can develop some intuition.
Generate Data for Example
For this example, I generated some data using R as follows:
As you can see from the above, the data come from the model $y = \beta_0 + \beta_1*x_1 + \beta_2*x_2 + \beta_3*x_2^2 + \epsilon$, where $\epsilon$ is a normally distributed random error term with mean $0$ and unknown variance $\sigma^2$. Furthermore, $\beta_0 = 1$, $\beta_1 = 10$, $\beta_2 = 0.4$ and $\beta_3 = 0.8$, while $\sigma = 1$.
Visualize the Generated Data via Coplots
Given the simulated data on the outcome variable y and the predictor variables x1 and x2, we can visualize these data using coplots:
The resulting coplots are shown below.
The first coplot shows scatterplots of y versus x1 when x2 belongs to four different ranges of observed values (which are overlapping) and enhances each of these scatterplots with a smooth, possibly non-linear fit whose shape is estimated from the data.
The second coplot shows scatterplots of y versus x2 when x1 belongs to four different ranges of observed values (which are overlapping) and enhances each of these scatterplots with a smooth fit.
The first coplot suggests that it is reasonable to assume that x1 has a linear effect on y when controlling for x2 and that this effect does not depend on x2.
The second coplot suggests that it is reasonable to assume that x2 has a quadratic effect on y when controlling for x1 and that this effect does not depend on x1.
Fit a Correctly Specified Model
The coplots suggest fitting the following model to the data, which allows for a linear effect of x1 and a quadratic effect of x2:
Construct Component Plus Residual Plots for the Correctly Specified Model
Once the correctly specified model is fitted to the data, we can examine component plus residual plots for each predictor included in the model:
These component plus residual plots are shown below and suggest that the model was correctly specified since they display no evidence of nonlinearity, etc. Indeed, in each of these plots, there is no obvious discrepancy between the dotted blue line suggestive of a linear effect of the corresponding predictor, and the solid magenta line suggestive of a non-linear effect of that predictor in the model.
Fit an Incorrectly Specified Model
Let's play the devil's advocate and say that our lm() model was in fact incorrectly specified (i.e., misspecified), in the sense that it omitted the quadratic term I(x2^2):
Construct Component Plus Residual Plots for the Incorrectly Specified Model
If we were to construct component plus residual plots for the misspecified model, we would immediately see a suggestion of non-linearity of the effect of x2 in the misspecified model:
In other words, as seen below, the misspecified model failed to capture the quadratic effect of x2 and this effect shows up in the component plus residual plot corresponding to the predictor x2 in the misspecified model.
The misspecification of the effect of x2 in the model m.mis would also be apparent when examining plots of the residuals associated with this model against each of the predictors x1 and x2:
As seen below, the plot of residuals associated with m.mis versus x2 exhibits a clear quadratic pattern, suggesting that the model m.mis failed to capture this systematic pattern.
Augment the Incorrectly Specified Model
To correctly specify the model m.mis, we would need to augment it so that it also includes the term I(x2^2):
Here are the plots of the residuals versus x1 and x2 for this correctly specified model:
Notice that the quadratic pattern previously seen in the plot of residuals versus x2 for the misspecified model m.mis has now disappeared from the plot of residuals versus x2 for the correctly specified model m.
Note that the vertical axis of all the plots of residuals versus x1 and x2 shown here should be labelled as "Residual". For some reason, R Studio cuts that label off.