Ridge Regression – Standardization of Dummy Indicators in Lasso

categorical datalassonormalizationpredictive-modelsstandardization

Say I have a data set with say 5000 rows and about 150 columns (5000 samples, 150 predictors/features) and I'm interested in a applying a ridge or lasso regression. (Let us assume using a logit link function — if it matters). Again, if it matters, we'll say that some features are highly correlated — e.g., that more than half of them are actually dummy variable indicators. Let us also say we are more concerned with prediction accuracy than model interpretability (if we wanted a clean interpretation, we would not use a penalized regression!)

Now, the standard recommendation for penalized linear models is to standardize the data (I believe Tibshirani recommends this always. By standardize, I mean center and scale by the mean and standard deviation). However, this will destroy a lot of the structure inherent in the data set. It doesn't make sense to do this I think – since this will lead to what were binary indicators always being present in the model unless Lasso zeroes them out.

Is there a better way to deal with this issue? One could normalize the rows (e.g., Normalizer() in sklearn) which will preserve structure, but in practice this gives me vastly different results.

Should one standardize all inputs? Only non-binary ones? What about ordinal categories (say a column with 1,2,3,… measuring, say, years of education). Normalization or min/max scaling?

Clarification: By standardizing binary inputs, you remove the sparsity in the data set, as you must first subtract the mean. This is what I mean by "destroying the structure".

Best Answer

You have identified an important but perhaps under-appreciated issue: there is no single one-size-fits-all approach to normalize categorical variables in penalized regression.

Normalization tries to ensure that penalization is applied fairly across all predictors, regardless of the scale of measurement. You don't want penalization of a predictor based on length to depend on whether you measured the length in millimeters or miles. So centering by the mean and scaling by the standard deviation before penalization can make sense for a continuous predictor.

But what does one mean by the "scale of measurement" of a categorical predictor? For a binary predictor having 50% prevalence, normalization turns original values of 0 and 1 into -1 and +1 respectively, for an overall difference of 2 units on the normalized scale. For a binary predictor having 1% prevalence, original values of 0 and 1 are transformed to approximately -0.1 and +9.9, for an overall difference of 10 units on the normalized scale.

Between binary predictors having these properties, normalization thus introduces a factor of 5 into their relative transformed scales, and thus in their sensitivities to penalization, versus the case in the original 0/1 coding. Is that what you want? And are normalized categorical predictors more "scale-free" so that the binary and continuous predictors are in some sense penalized fairly with respect to each other? You have to make that decision yourself, based on knowledge of the subject matter and your goals for prediction.

Harrell's Regression Modeling Strategies covers this in section 9.10 on Penalized Maximum Likelihood Estimation. As he notes, there is a further problem with multi-category predictors, as results of normalization can differ depending on the choice of reference value. In this case, try penalizing together the difference of all coefficients for the same categorical predictor from their mean, instead of penalizing each coefficient individually.

You do have some flexibility in choosing how to penalize. Some standard software, like glmnet() in R, allows for differential penalization among predictors, which Harrell discusses as an alternative to pre-normalizing the predictor values themselves so that the net result is scale-free. But you still have to grapple with the issue of what you wish to consider as the "scale" of a categorical predictor.

If you have no useful information from subject-matter knowledge about how best (if at all) to scale your categorical predictors, why not just compare different approaches to scaling them as you build the model? You should of course validate such an approach, for example by repeating the entire model-building process on multiple bootstrap resamples of the data and testing the model predictions on the original data. With your interest in making useful predictions, this provides a principled way to see what prediction method works best for you.

I appreciate the issue of destroying the sparse structure provided by binary/dummy coding, and that can be an issue with the efficiency of handling very large data sets that are coded as sparse matrices. For the scale of your problem, with just a few thousand cases and a couple of hundred predictors, this isn't a practical problem and it will make no difference in how the regression is handled: however you might have normalized the categorical variables, each will still have the same number of categories as before, just with different numerical values (and thus different sensitivity to penalization).

Note that normalization by rows does not solve the problems discussed here and may exacerbate them. Normalization by rows can be a useful step in situations like gene expression studies, where all measurements are essentially on the same scale but there might be systematic differences in overall expression among samples. With mixes of continuous predictors measured on different scales together with categorical predictors, however, row-normalization won't be helpful.

Best Answer

Related Solutions

Solved – Rescaling exponentially distributed variables before clustering

Feature Selection: Why Ridge Regression Lacks Interpretability Compared to LASSO

Related Question