Machine Learning – Should One Hot Encoding or Dummy Variables Be Used With Ridge Regression?

machine learningregularization

For a regression problem in which the predictor is a single categorical variable with $q$ categories, Ridge regression can be considered the Best Linear Unbiased Predictor (BLUP) for the mixed model

$$ \mathbf{y} = \mathbf{X} \beta +\mathbf{Zu}+ \boldsymbol{\epsilon} $$

In this case, $\mathbf{X}$ is just a columns of 1s and $\beta$ is the intercept. $\mathbf{Z}$ is the design matrix that encodes the $q$ random effects. In this situation, Ridge regression is the BLUP estimator if we set the ridge parameter to $\lambda = \sigma_{\epsilon}^2 / \sigma^2$. Here, $\sigma_{\epsilon}$ is the variance of $\boldsymbol{\epsilon}$, and $\sigma$ is the variance of $\mathbf{u}$ (here, the random effects are assumed to be isotropic). I've been told this is equivalent to fitting a Ridge model in which the feature matrix is a one hot encoding (that is, all $q$ categories appear as columns) of the categories.

Some convincing arguments are made, which I will summarize here in python code. They are themselves summaries from a paper called That BLUP is a good thing The Estimation of Random Effects.. Here, 3 groups are simulated. 4 Ridge models are fit: 3 in which each group takes turns being the reference group, and one in which all groups appear in the feature matrix. Shown below is a plot of the predictions

import numpy as np
import pandas as pd 
import sklearn as sk
from sklearn.linear_model import LinearRegression, Ridge
import seaborn as sb 
import matplotlib.pyplot as plt

D=pd.DataFrame({'group':[1,1,1,1,1,2,2,2,2,2,3,3,3,3,3],'y':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})

# Now let's dummy code the group in four different ways 
X = pd.get_dummies(D['group'],drop_first=False).values # Full dummy code with a dummy variable for each group 
X1 = X[:,1:3]   # Drop the first group 
X2 = X[:,[0,2]] # Drop the second group 
X3 = X[:,:2]    # Drop the last group 

# Now let's use the different dummy coding schemes for Ridge regression, using a Ridge coefficient of 1
# First the first one: 
R1 = Ridge(alpha = 1)
R1.fit(X1,D['y'])
ypred_R1=R1.predict(Xpred1)
ypred_R1
>>> array([ 4.875     ,  7.47916667, 11.64583333])

# Then dropping the middle group  
R2 = Ridge(alpha = 1)
R2.fit(X2,D['y'])
ypred_R2=R2.predict(Xpred2)
ypred_R2
>>> array([ 3.83333333,  8.        , 12.16666667])

# And finally dropping the third group
R3 = Ridge(alpha = 1)
R3.fit(X3,D['y'])
ypred_R3=R3.predict(Xpred3)
ypred_R3
>>>array([ 4.35416667,  8.52083333, 11.125     ])

# Now we have 3 regressors, instead of two. To achieve a similar amount of shrinkage, we need to 
# therefor increase our Ridge coefficient a bit.   
R = Ridge(alpha = 3, fit_intercept = True)
R.fit(X,D['y'])
ypred_R=R.predict(Xpred)
ypred_R
>>>array([ 4.875,  8.   , 11.125])

enter image description here

It turns out, that making one group the reference group makes implicit assumptions about the covariance of the $\mathbf{u}$. Here is a plot of the covariance structure when we drop one of the variables to use as a reference

enter image description here

Compare this to the true covariance structure assumed by the model with all 3 predictors in the feature matrix

enter image description here

Question

I've never before seen the recommendation to keep all categories in a Ridge regression. Indeed, if we are cross validating over the penalty, the resulting covariance structure does not seem worth the instability induced by the collinearity.

Does a ridge model in which all categories appear as binary indicators perform better (i.e. lower loss) than a ridge model which absorbs a category into the intercept term? If so, why? Some (not very rigorous) experiments seem to hint that the answer is "no", but I'm interested in hearing the perspectives of other data scientists and statisticians.

Additionally, if the goal is prediction then what are the general consequences of imposing the wrong covariance structure on the data by making one group a reference category?

Should we include all categories when fitting a ridge model?

Best Answer

This issue has been appreciated for some time. See Harrell on page 210 of Regression Modeling Strategies, 2nd edition:

For a categorical predictor having $c$ levels, users of ridge regression often do not recognize that the amount of shrinkage and the predicted values from the fitted model depend on how the design matrix is coded. For example, one will get different predictions depending on which cell is chosen as the reference cell when constructing dummy variables.

He then cites the approach used in 1994 by Verweij and Van Houwelingen, Penalized Likelihood in Cox Regression, Statistics in Medicine 13, 2427-2436. Their approach was to use a penalty function applied to all levels of an unordered categorical predictor. With $l(\beta)$ the partial log-likelihood at a vector of coefficient values $\beta$, they defined the penalized partial log-likelihood at a weight factor $\lambda$ as:

$$l^{\lambda}(\beta) = l(\beta) - \frac{1}{2} \lambda p(\beta)$$

where $p(\beta)$ is a penalty function. At a given value of $\lambda$, coefficient estimates $b^{\lambda}$ are chosen to maximize this penalized partial likelihood.

They define the penalty function for an unordered categorical covariate having $c$ levels as:

$$p_0(\beta) = \sum_{j=1}^c \left( \beta_j - \bar \beta \right)^2$$

where $\bar \beta$ is the average of the individual regression coefficients for the categories. "This function penalizes $\beta_j$'s that are further from the mean." It also removes the special treatment of a reference category $1$ taken to have $\beta_1=0$. In illustrating their penalized partial likelihood approach in a model with a 49-level categorical covariate, they constrained the coefficients to sum to zero, and then optimized to choose $\lambda$ and generate the penalized coefficient estimates.

Penalization must involve all levels of a multi-level categorical covariate in some way, as the OP and another answer indicate. One-hot encoding is one way to do that. This alternative shows a way to do so with dummy coding, in a way that seems to keep more emphasis on deviations of individual coefficient values from the mean of coefficients within the same covariate, rather than on their differences from coefficients of unrelated covariates.

Related Question