Geographic Data in Machine Learning – How to Represent Zip Codes

feature-engineeringmachine learningmany-categories

I am building a model and I think that geographic location is likely to be very good at predicting my target variable. I have the zip code of each of my users. I am not entirely sure about the best way to include zip code as a predictor feature in my model though. Although zip code is a number, it doesn't mean anything if the number goes up or down. I could binarize all 30,000 zip codes and then include them as features or new columns (e.g., {user_1: {61822: 1, 62118: 0, 62444: 0, etc.}}. However, this seems like it would add a ton of features to my model.

Any thoughts on the best way to handle this situation?

Best Answer

One of my favorite uses of zip code data is to look up demographic variables based on zipcode that may not be available at the individual level otherwise...

For instance, with http://www.city-data.com/ you can look up income distribution, age ranges, etc., which might tell you something about your data. These continuous variables are often far more useful than just going based on binarized zip codes, at least for relatively finite amounts of data.

Also, zip codes are hierarchical... if you take the first two or three digits, and binarize based on those, you have some amount of regional information, which gets you more data than individual zips.

As Zach said, used latitude and longitude can also be useful, especially in a tree based model. For a regularized linear model, you can use quadtrees, splitting up the United States into four geographic groups, binarized those, then each of those areas into four groups, and including those as additional binary variables... so for n total leaf regions you end up with [(4n - 1)/3 - 1] total variables (n for the smallest regions, n/4 for the next level up, etc). Of course this is multicollinear, which is why regularization is needed to do this.

Related Question