Solved – How to deal with nominal variable with too many levels

categorical datageneralized linear modellogisticmany-categoriesmodeling

currently I'm trying to model a response variable y, and I have zip code as my independent variable, my model is logistic regression. When it comes to nominal variable, the text book method is to create k-1 dummy variable (assuming the nominal variable have k different levels), but zip code's k is too big, I can't create that amount of dummy variables, is there any other ways to deal with this?

Or more generally, how to deal with nominal variables with too many levels (k>=100)?

Best Answer

Instead of ZIP code use something else. Some options:

First 3 digits of ZIP code - this might work if you had data from a medium sized region; it would not work if you had data from the whole USA

County - not great but used often. Problem is counties vary greatly in population.

Congressional district - these are weird geographically, but have roughly equal populations

State - has some problems with population size (although at least all are large).

Region or division, as defined by the Census . Other people have come up with other variations of regions.

you might also be able to combine county, state, region or division with a variable for urban/suburban/rural

Related Question