Solved – Feature standardization for polynomial regression with categorical data

categorical-encodingmachine learningmultiple regressionregressionstandardization

Assuming I have a set of p=n_features, here set to 3 independent variables, each with n=n_samples, without any missing values, defining my design matrix $X$ as follows:

$X = \begin{bmatrix}
x_{11} & \dots & x_{1p} \\
\vdots & \ddots & \vdots \\
x_{n1} & \dots & x_{np}
\end{bmatrix}$

For my dataset with p=3 features:

$X=\left[\vec{x_1},\ \vec{x_2},\ \vec{x_3}\right]$

The variables are of the following kinds:

  • $y$, the dependent variable: continuous numeric variable
  • $x_1$ and $x_2$: continuous numeric variables, with different value ranges requiring standardization/scaling due to l1/l2 regularization
  • $x_3$: categorical numeric variable with the 3 levels $\left[0,1,2\right]$, requiring dummy coding/one hot encoding into $k-1=2$ binary dummy variables

I want to feed this dataset into a polynomial regression of second degree with interaction terms (also regularization is applied), meaning my linear model to fit is of the following form:

$y=c + c_1x_1 + c_2x_2 + c_3x_3 + c_4x_1x_2 + c_5x_1x_3 + c_6x_2x_3 + c_7x_1^2 + c_8x_2^2 + c_9x_3^2 + \vec{\epsilon}$

with the intercept $c$, the coefficients $c_1\dots c_9$ and the error $\vec{\epsilon}$.
A polynomial transformation of the design matrix yields the transformed design matrix $X^*$:
$X^*=\left[\vec{x_1^*},\ \vec{x_2^*},\ \vec{x_3^*},\ \vec{x_4^*},\ \vec{x_5^*},\ \vec{x_6^*},\ \vec{x_7^*},\ \vec{x_8^*},\ \vec{x_9^*}\right]$
with $\vec{x_1^*}=\vec{x_1},\quad \dots,\quad \vec{x_4^*}=\vec{x_1}\vec{x_2},\quad \vec{x_5^*}=\vec{x_1}\vec{x_3},\quad \dots \vec{x_9^*}=\vec{x_3^2}$


Problem description

We now have interaction terms between continuous and categorical variables, namely $c_5x_1x_3$ and $c_6x_2x_3$.
Dummy coding of the categoric variable has not yet been performed! (More polynomial terms if done before transformation.)
Standardization of the cont. indep. variables still needs to be done!
Having a model only consisting of continuous variables, I'd standardize after poly. transformation in most cases. In this case, with mixed types of indep. variables, I'd standardize the continuous variables and dummy code the categorical variables before polynomial transformation.

Questions

  1. Should I standardize and dummy code after polynomial transformation?
  2. If yes, how to deal with the interaction terms of categorical and continuous variables?
  3. If yes, how serious are the disadvantages introduced with standardizing/dummy coding before poly. transf.?
  4. In general: How to avoid alternating signs (making "random" negative values) by subtracting the mean and multiplying for interaction terms (f.i. $x_1x_2$ where both $x_1$ and $x_2$ were positive before standardization, but afterwards $x_1$ is negative)? Just scale by the standard deviation $\sigma$ (and possibly min-max-scale afterwards)?

Best Answer

When a LASSO model includes a categorical predictor with more than 2 levels, you usually want to ensure that all levels of the predictor are selected together as with the group LASSO. When a LASSO model includes interaction terms, it's important to maintain the hierarchy of the interactions. That is, if LASSO selects an interaction term it should also select the terms of the individual predictors contributing to the interaction. That's discussed briefly here and with more rigor by Bien, Taylor and Tibshirani in "A lasso for hierarchical interactions", Ann. Stat. 41; 1111–1141, 2013.

For your questions 1 and 3, Bien, Taylor and Tibshirani seem to deal directly with your question:

It is common with the lasso to standardize the predictors so that they are on the same scale. In this paper, we standardize X [matrix of individual predictors] so that its columns have mean 0 and standard deviation 1; we then form Z [matrix of interaction terms] from these standardized predictors and, finally, center the resulting columns of Z.

As the quadratic terms in your model are essentially self-interactions it would seem that you would be advised to proceed similarly. That is, standardize the continuous predictors $x_1$ and $x_2$ (subtract mean, divide by standard deviation), form the polynomial and interaction terms from the standardized predictors, then only center the polynomial and interaction terms. (As I understand it the centering of the interactions isn't necessary but does simplify interpretations of coefficients.) The corresponding R hierNet package by Bien and Tibshirani provides those choices as defaults: center features, standardize main effects, and don't standardize interactions. The hierNet() function does allow for other choices, if you want to play with other possibilities.

With respect to question 2, as noted in a comment it's not clear whether or how best to standardize a categorical predictor, particularly with more than 2 levels. Provided you handle it with group LASSO and respect the hierarchy of interactions, however, there isn't any problem in "deal[ing] with the interaction terms of categorical and continuous variables." If you choose treatment coding of the categorical predictor then the coefficients of the continuous predictors and their interactions with each other represent those values when the categorical predictor is at its reference level. The corresponding interaction terms with the other levels of the predictor are the differences of the coefficients from those values for the reference level. I see nothing to be gained by incorporating powers of the dummy variables representing the categorical predictor.

With respect to question 4, the "alternating signs" in interaction values after centering are features, not bugs. See this page for example. Leave them alone.