Solved – Encoding IP Address as a Predictor in Machine Learning

feature-engineeringmachine learningmany-categories

Is there some approach to "encoding" IP Address (IPv4) in a way that the new representation can capture both cardinality and the statistical distribution of the full range of IP address and also aspects like belonging to the same network. I think that converting to an integer (two-way) or hashing dont capture the aspects aforementioned. I know that using vector-representation could be an alternative but I would like to know if there is another.

Best Answer

I think sites like this give you the ISP associated with an IP, and you can back out latitude-longitude coordinates/country/post-code/timezone from that. Depending on the specifics of your problem, any of those could be a pretty good spatial predictor/feature. I have a hunch that people on this site that people on this site are better at answering questions after you've chosen one of these representations, and not on any of the specifics about networking things such as how IP addresses can be used to get those locations.

Your data set is probably longer than two or three observations, so it might be useful to try to capture these results programmatically. Questions like that have been asked many times on stackoverflow.

If you're interested in the distribution of digits of the IP address, post some data, and then I would have some opinions. Right now, I have no intuition about that.