LASSO with Interaction Terms – Main Effects and Shrinking to Zero

glmnetlassomachine learningregularization

LASSO regression shrinks coefficients towards zero, thus providing effectively model selection. I believe that in my data there are meaningful interactions between nominal and continuous covariates. Not necessarily, however, are the 'main effects' of the true model meaningful (non-zero). Of course I do not know this since the true model is unknown. My objectives are to find the true model and predict the outcome as closely as possible.

I have learned that the classical approach to model building would always include a main effect before an interaction is included. Thus there cannot be a model without a main effect of two covariates $X$ and $Z$ if there is an interaction of the covariates $X*Z$ in the same model. The step function in R consequently carefully selects model terms (e.g. based on backward or forward AIC) abiding to this rule.

LASSO seems to work differently. Since all parameters are penalized it may without doubt happen that a main effect is shrunk to zero whereas the interaction of the best (e.g. cross-validated) model is non-zero. This I find in particular for my data when using R's glmnet package.

I received criticism based on the first rule quoted above, i.e. my final cross-validated Lasso model does not include the corresponding main effect terms of some non-zero interaction. However this rule seems somewhat strange in this context. What it comes down to is the question whether the parameter in the true model is zero. Let's assume it is but the interaction is non-zero, then LASSO will identify this perhaps, thus finding the correct model. In fact it seems predictions from this model will be more precise because the model does not contain the true-zero main effect, which is effectively a noise variable.

May I refute the criticism based on this ground or should I take pre-cautions somehow that LASSO does include the main effect before the interaction term?

Best Answer

One difficulty in answering this question is that it's hard to reconcile LASSO with the idea of a "true" model in most real-world applications, which typically have non-negligible correlations among predictor variables. In that case, as with any variable selection technique, the particular predictors returned with non-zero coefficients by LASSO will depend on the vagaries of sampling from the underlying population. You can check this by performing LASSO on multiple bootstrap samples from the same data set and comparing the sets of predictor variables that are returned.

Furthermore, as @AndrewM noted in a comment, the bias of estimates provided by LASSO means that you will not be predicting outcomes "as closely as possible." Rather, you are predicting outcomes that are based on a particular choice of the unavoidable bias-variance tradeoff.

So given those difficulties, I would hope that you would want to know for yourself, not just to satisfy a critic, the magnitudes of main effects of the variables that contribute to the interaction. There is a package available in R, glinternet, that seems to do precisely what you need (although I have no experience with it):

Group-Lasso INTERaction-NET. Fits linear pairwise-interaction models that satisfy strong hierarchy: if an interaction coefficient is estimated to be nonzero, then its two associated main effects also have nonzero estimated coefficients. Accommodates categorical variables (factors) with arbitrary numbers of levels, continuous variables, and combinations thereof.

Alternatively, if you do not have too many predictors, you might consider ridge regression instead, which will return coefficients for all variables that may be much less dependent on the vagaries of your particular data sample.

Related Question