Is there some approach to "encoding" IP Address (IPv4) in a way that the new representation can capture both cardinality and the statistical distribution of the full range of IP address and also aspects like belonging to the same network. I think that converting to an integer (two-way) or hashing dont capture the aspects aforementioned. I know that using vector-representation could be an alternative but I would like to know if there is another.
Solved – Encoding IP Address as a Predictor in Machine Learning
feature-engineeringmachine learningmany-categories
Related Solutions
For categorical variables code a new category of "missing"; for continuous variables set missing values to any constant value $a$ & add an indicator variable for missingness. E.g. let the linear predictor be given by $$\eta=\beta_0 + \beta_1 x_1 + \beta_2 q_1 + \ldots$$ When $x_1$ is not missing set $q_1$ to $0$: $$\eta=\beta_0 + \beta_1 x_1 + \ldots$$ When $x_1$ is missing set $q_1$ to $1$ & $x_1$ to $a$: $$\eta=\beta_0 + \beta_1 a + \beta_2 + \ldots$$ Note that whatever value you choose for $a$ affects only the interpretation of $\beta_2$, leaving the model's predictions unchanged.† The new predictor $(x_1, q_1)$ can be used in interactions, & $q_1$ alone; but not $x_1$ alone as it contains the arbitrary $a$.
Unfortunately, even when data are missing completely at random this technique leads to biased estimates when the predictors are correlated (i.e. almost always, with observational data). See Jones (1996), "Indicator & stratification methods for missing explanatory variables in multiple linear regression", JASA, 91, 433. Imputation of missing values is preferable. Little & Rubin (2002), Statistical Analysis with Missing Data, is a good introduction to the problems arising with missing data & techniques for dealing with them.
† Of course you need to be careful when using any techniques that penalizes coefficients according to their magnitude.
Binary variables
No encoding is needed: use them as is.
Nominal data
When you have an variable that can take on a finite number of values, that's called a categorical variable. When the values can't be ordered (e.g., red, blue, green), that's called a nominal variable. A nominal variable is one kind of categorical variable.
For nominal variables, the usual way to encode them is with a one-hot encoding. If there are $N$ possible values for the variable, you map each value to a $N$-vector that has a $1$ in the position corresponding to that value and $0$ elsewhere.
For instance: red $\mapsto (1,0,0)$, blue $\mapsto (0,1,0)$, green $\mapsto (0,0,1)$.
Ordinal data
When you have a categorical variable where the values can be ordered (sorted), but the ordering doesn't imply anything about how much they differ, that's called a ordinal variable (see ordinal data).
For example, suppose you have a ranking: John finished in 3rd place, Jane in 6th place. You know that John finished before Jane, but that doesn't necessarily mean that John was $6/3=2$ times as fast as Jane.
You can encode ordinal data using the thermometer trick. If there are $N$ possible values for the variable, then you map each value to a $N$-vector, where you put a $1$ in the position that matches the value of the variable and all subsequent position.
For instance: first place $\mapsto (1,1,1)$, second place $\mapsto (0,1,1)$, third place $\mapsto (0,0,1)$.
You can also apply binning if $N$ is too large, but usually it's better not to do that.
Numerical variables
Finally, you may encounter variables that directly measure a number, and where they can be not only ordered, but also subtracted or divided. Then, it's typically best to use the number directly, or possibly use the logarithm of the number. (You might take the logarithm if the number represents a ratio, or if there is a very wide range of values.)
Useful background
To understand these terms, it's helpful to learn about "level of measurements": https://en.wikipedia.org/wiki/Level_of_measurement.
Scaling
Finally, when you're using neural networks or "deep learning", you'll normally want to standardize/rescale all numerical attributes before applying deep learning. I suggest you treat that as a separate process from the feature mappings mentioned above, to be performed after you apply the feature mapping.
Best Answer
I think sites like this give you the ISP associated with an IP, and you can back out latitude-longitude coordinates/country/post-code/timezone from that. Depending on the specifics of your problem, any of those could be a pretty good spatial predictor/feature. I have a hunch that people on this site that people on this site are better at answering questions after you've chosen one of these representations, and not on any of the specifics about networking things such as how IP addresses can be used to get those locations.
Your data set is probably longer than two or three observations, so it might be useful to try to capture these results programmatically. Questions like that have been asked many times on stackoverflow.
If you're interested in the distribution of digits of the IP address, post some data, and then I would have some opinions. Right now, I have no intuition about that.