Solved – Predict multiple outcome vectors at once (no multinomial or multiclass)

classificationmultivariate analysisr

Let's say I have a dataset where i need to predict 2 or more variables (classification). Please understand that i really don't mean multinomial or multiclass classification. It's all about predicting multiple outcome vectors.

$x_1$,$x_2$,$x_3$,$x_4$ $x_n$—> $y_1$,$y_2$

Does anyone know what sort of approach is good for this (preferably in r)? To be honest i've searched for answers a lot but i couldn't find anything.

Best Answer

The decision you need to make, is whether you want to use two separate models, or one combined model. In general, using a single model is better, as it takes care of correlations between $y_{1}$ and $y_{2}$, but it's harder/requires you to think about the problem more.

For example, as far as I know, logistic regression doesn't lend itself very naturally to predicting two separate target variables. If both of your target variables are binary, you could make a single target variable which takes four outcomes, dealing with each of the four possible combinations This is probably a good approach, which deals very naturally with correlations in your two targets, but obviously as the cardinality of either of your targets increases, this doesn't scale very well.

Models that lend themselves quite naturally to multiple targets tend to be models that "generate features". Two very different classes of models that can be interpreted to do this are tree-based models (e.g. decision trees/random forests) and Deep Learning models.

If you train a decision tree with multiple categorical target variables, it essentially applies a compound condition to evaluate the best split at each node. For one variable, it looks for the partitioning of the feature space which best separates the data into two groups which give you the best (for example) Gini/Information Entropy separation of your target variable. With multiple targets, it looks for the split in feature space which gives you the best average (for example) Gini/Information Entropy split.

This has the advantage that adding more targets can actually make your process more robust to over-fitting, as the decision tree algorithm has to find a partitioning of your feature space which separates multiple target variables simultaneously. The more target variables a single partitioning of the data gives you information about, the less likely it is to be a fluke. The flip-side of this is that if your target variables are very different, it's perhaps not reasonable to expect that the same partitioning in space should give you information about both simultaneously.

Similarly, the hidden layers of Neural Networks take your raw features and try to learn a non-linear mapping to a new set of features which are the "right" features to feed to a logistic regression. A multiple-target Neural Network differs only from a standard neural network, in that it has to generate one non-linear mapping to find one set of features, that get passed to a separate logistic regression for each target. This is analogous to the decision tree discussed above. It's more robust to over-fitting, these "new" features that the neural network generates are less likely to be a fluke, if they can be used in multiple logistic regression models to get meaningful results over a range of targets, but likewise it might not be reasonable to think that very different target variables should be explainable with the same set features.

In conclusion, you have three options:

  1. Use two separate models

    Pros: Easy, allows you to learn features specific to each target

    Cons: Assumes targets are independent of one and other so does not learn correlations between target variables

  2. Create compound target from the two targets, categorical for every possible combination of the two

    Pros: Best way to capture correlations between target variables

    Cons: Can quickly get very sparse if either (or both) targets have high cardinality. Probably a good solution if you have enough data, but you'll need sufficient numbers of examples of every possible value of this compound target.

3: Train a model with two targets simultaneously (e.g. random forest or neural network)

Pros: Forces model to learn meaningful features and thus most robust to over-fitting. Code is easiest to keep track of as you have one model

Cons: If target variables are very different, you are likely to have much worse training loss than either of the other two suggests, as model as to compromise and learn one way to best determine two target variables.

Note that the 3rd method is the most robust to overfitting but is likely to have the worst training loss, so you'll need to cross-validate to determine which is best for your particular use case.

Related Question