Solved – Predicting with both continuous and categorical features

categorical dataclassificationcontinuous datadiscrete datapredictive-models

Some predictive modeling techniques are more designed for handling continuous predictors, while others are better for handling categorical or discrete variables. Of course there exist techniques to transform one type to another (discretization, dummy variables, etc.). However, are there any predictive modeling techniques that designed to handle both types of input at the same time without simply transforming the type of the features? If so, do these modeling techniques tend to work better on data for which they are a more natural fit?

The closest thing that I know of would be that usually decision trees handle discrete data well and they handle continuous data without requiring an up front discretization. However, this isn't quite what I was looking for since effectively the splits on continuous features are just a form of dynamic discretization.

For reference, here are some related, non-duplicate questions:

Best Answer

As far as I know, and I've researched this issue deeply in the past, there are no predictive modeling techniques (beside trees, XgBoost, etc.) that are designed to handle both types of input at the same time without simply transforming the type of the features.

Note that algorithms like Random Forest and XGBoost accept an input of mixed features, but they apply some logic to handle them during split of a node. Make sure you understand the logic "under the hood" and that you're OK with whatever is happening in the black-box.

Yet, distance/kernel based models (e.g., K-NN, NN regression, support vector machines) can be used to handle mixed type feature space by defining a “special” distance function. Such that, for every feature, applies an appropriate distance metric (e.g., for a numeric feature we’ll calculate the Euclidean distance of 2 numbers while for a categorical feature we’ll simple calculate the overlap distance of 2 string values). So, the distance/similarity between user $u_1$ and $u_2$ in feature $f_i$, as follows: $d(u_1,u_2 )_{f_i}=(dis-categorical(u_1,u_2 )_{f_i} $ if feature $f_i$ is categorical, $d(u_1,u_2 )_{f_i}=dis-numeric(u_1,u_2 )_{f_i} $ if feature $f_i$ is numerical. and 1 if feature $f_i$ is not defined in $u_1$ or $u_2$.

Some known distance function for categorical features:

  • Levenshtien distance (or any form of "edit distance")

  • Longest common subsequence metric

  • Gower distance
  • And more metrics here
    • Boriah, S., Chandola and V., Kumar, V. (2008). Similarity measures for categorical data: A comparative evaluation. In: Proceedings of the 8th SIAM International Conference on Data Mining, SIAM, p. 243-254.