Solved – How many dumthe variables should we include in our multiple linear regression analysis

categorical-encodingmultiple regression

I am building a multiple linear regression model and wonder how many dummy variables can be included.
I have 2 categorical variables:
1 with 13 levels and the second with 20 levels.
Can I include all of them and it's way too much for Multiple Linear Regression?

Best Answer

You have two categorical variables, one with 20, other with 13 levels. That is in itself not to much for multiple regression. To estimate those will use $(20-1)+(13-1)=31$ df (degrees of freedom). If that will work, depends on the total number of observations (and number of continuous, measured variables.) One rule of thumb is to have at least 15 subjects per parameter in the model. How many observation you have for each level could also be a consideration.

So, if you do not have enough observations, you could consider regularization. For categorical variables the fused lasso is an idea. See Principled way of collapsing categorical variables with many levels?.

Related Question