R Regression – Factor Levels in Interaction Terms for Generalized Linear Models

categorical datageneralized linear modelinteractionrregression

I have not been able to find an answer to this in other discussions or in my readings.

Say I am modeling carVal (i.e., a car's value) based on mpg (numeric variable) and type (factor variable with levels 0 = sedan, 1 = van, 2 = truck, 3 = suv) using a glm(). I have read that if I am using some algorithm to select the "best" model features, it is not appropriate to drop some of the factor variables but keep the others (i.e., carVal ~ mpg + type1 is not valid, it would have to be carVal ~ mpg + type1 + type2 + type3).

My question is, if I include an interaction term between mpg and type, is it appropriate to have an interaction for only certain levels of mpg and type, but not include all levels of type for the interaction.

For example, is this a valid model:

carVal ~ mpg + type1 + type2 + type3 + type1:mpg

Or, would the formula have to be the following:

carVal ~ mpg + type1 + type2 + type3 + type1:mpg + type2:mpg + type3:mpg

Here is an example of the code I am using in version 4.0.2 of R:

library(leaps)

carVal = c(1000, 15000, 1500, 2000, 2500, 5000, 8000, 9500, 11000)  
mpg = c(29, 45, 20, 28, 30, 40, 35, 38, 47)  
type = as.factor(c(1, 2, 2, 3, 1, 0, 1, 0, 0))  
car.data = data.frame(carVal, mpg, type)  
subset.model = regsubsets(x = as.formula('carVal ~ mpg + type + type:mpg'), data = car.data, method = 'exhaustive')

summary(subset.model)

Best Answer

First, I would avoid any stepwise procedures.

That said:

if I include an interaction term between mpg and type, is it appropriate to have an interaction for only certain levels of type and mpg, but not include all levels of type for the interaction.

Normally you would just specify the model as

carVal ~ mpg * type

or equivalently:

carVal ~ mpg + type + mpg:type

Then the software will create all the necessary dummy variables and interactions between them and mpg

It looks from your question that you might be creating the dummy variables yourself, in which case there is no technical reason why you can't omit some of the interactions, if you have good reason to, but in my experience this can create all kinds of problems such as a rank-deficient model matrix, or an overfitted model which generalises extremely poorly to new data.

So if you want to interaction, just use mpg * type - it will make your life much easier.