Solved – Predict multiple outcome vectors at once (no multinomial or multiclass)

classificationmultivariate analysisr

Let's say I have a dataset where i need to predict 2 or more variables (classification). Please understand that i really don't mean multinomial or multiclass classification. It's all about predicting multiple outcome vectors.

$x_1$,$x_2$,$x_3$,$x_4$ $x_n$—> $y_1$,$y_2$

Does anyone know what sort of approach is good for this (preferably in r)? To be honest i've searched for answers a lot but i couldn't find anything.

Best Answer

The decision you need to make, is whether you want to use two separate models, or one combined model. In general, using a single model is better, as it takes care of correlations between $y_{1}$ and $y_{2}$, but it's harder/requires you to think about the problem more.

For example, as far as I know, logistic regression doesn't lend itself very naturally to predicting two separate target variables. If both of your target variables are binary, you could make a single target variable which takes four outcomes, dealing with each of the four possible combinations This is probably a good approach, which deals very naturally with correlations in your two targets, but obviously as the cardinality of either of your targets increases, this doesn't scale very well.

Models that lend themselves quite naturally to multiple targets tend to be models that "generate features". Two very different classes of models that can be interpreted to do this are tree-based models (e.g. decision trees/random forests) and Deep Learning models.

If you train a decision tree with multiple categorical target variables, it essentially applies a compound condition to evaluate the best split at each node. For one variable, it looks for the partitioning of the feature space which best separates the data into two groups which give you the best (for example) Gini/Information Entropy separation of your target variable. With multiple targets, it looks for the split in feature space which gives you the best average (for example) Gini/Information Entropy split.

This has the advantage that adding more targets can actually make your process more robust to over-fitting, as the decision tree algorithm has to find a partitioning of your feature space which separates multiple target variables simultaneously. The more target variables a single partitioning of the data gives you information about, the less likely it is to be a fluke. The flip-side of this is that if your target variables are very different, it's perhaps not reasonable to expect that the same partitioning in space should give you information about both simultaneously.

Similarly, the hidden layers of Neural Networks take your raw features and try to learn a non-linear mapping to a new set of features which are the "right" features to feed to a logistic regression. A multiple-target Neural Network differs only from a standard neural network, in that it has to generate one non-linear mapping to find one set of features, that get passed to a separate logistic regression for each target. This is analogous to the decision tree discussed above. It's more robust to over-fitting, these "new" features that the neural network generates are less likely to be a fluke, if they can be used in multiple logistic regression models to get meaningful results over a range of targets, but likewise it might not be reasonable to think that very different target variables should be explainable with the same set features.

In conclusion, you have three options:

Use two separate models

Pros: Easy, allows you to learn features specific to each target

Cons: Assumes targets are independent of one and other so does not learn correlations between target variables
Create compound target from the two targets, categorical for every possible combination of the two

Pros: Best way to capture correlations between target variables

Cons: Can quickly get very sparse if either (or both) targets have high cardinality. Probably a good solution if you have enough data, but you'll need sufficient numbers of examples of every possible value of this compound target.

3: Train a model with two targets simultaneously (e.g. random forest or neural network)

Pros: Forces model to learn meaningful features and thus most robust to over-fitting. Code is easiest to keep track of as you have one model

Cons: If target variables are very different, you are likely to have much worse training loss than either of the other two suggests, as model as to compromise and learn one way to best determine two target variables.

Note that the 3rd method is the most robust to overfitting but is likely to have the worst training loss, so you'll need to cross-validate to determine which is best for your particular use case.

Related Solutions

Solved – Predicting multiple output variables based on multiple input variables

You can use Multivariate Multiple Linear Regressior like blow:

fit <- lm(cbind(Y1, Y2) ~ X1 + X2 + X3, data=train_set)

Hypothesis Testing – Differences Between Wilcoxon Rank Sum Test and Kolmogorov-Smirnov Test Results

The crux of @Glen_b's answer (+1) is that these are two different tests that are "designed to pick up... [different and] specific kinds of differences" between the two distributions. So to understand how the results (in terms of whether they are significant or not) can differ between the Wilcoxon rank sum test and the Kolmogorov-Smirnov tests, we need to understand what the tests are designed to detect.

The Wilcoxon rank sum test tests if:

the probability of an observation from the population X exceeding an observation from the second population Y equals the probability of an observation from Y exceeding an observation from X: P(X > Y) = P(Y > X) or P(X > Y) + 0.5 · P(X = Y) = 0.5

That is, it is testing if values of X tend to be larger or smaller than values of Y.
The Kolmogorov-Smirnov test assesses the largest¹ difference between the two empirical cumulative distribution functions (ECDFs) and compares it to its sampling distribution assuming the distributions are the same.

From here, it is easy to see how there can be datasets where the tests will yield different results.

The Wilcoxon will be significant while the KS will not when one sample is consistently greater than the other, but not by a large absolute value, and where the distribution shapes are largely the same.

set.seed(9825)
g1 = rnorm(10)
g2 = g1+1.27

wilcox.test(g1, g2)
#   Wilcoxon rank sum test
# 
# data:  g1 and g2
# W = 22, p-value = 0.03546
# alternative hypothesis: true location shift is not equal to 0
ks.test(g1, g2)
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  g1 and g2
# D = 0.5, p-value = 0.1678
# alternative hypothesis: two-sided

The KS will be significant while the rank sum test will not when the means and medians are the same but the shapes differ markedly.

set.seed(3806)
g1 = scale(rnorm(15),       center=TRUE, scale=FALSE)
g2 = scale(rnorm(15, sd=5), center=TRUE, scale=FALSE)

wilcox.test(g1, g2)
#   Wilcoxon rank sum test
# 
# data:  g1 and g2
# W = 131, p-value = 0.461
# alternative hypothesis: true location shift is not equal to 0
ks.test(g1, g2)
#   Two-sample Kolmogorov-Smirnov test
# 
# data:  g1 and g2
# D = 0.53333, p-value = 0.02625
# alternative hypothesis: two-sided

_{1. More technically the supremum.}

Best Answer

Related Solutions

Solved – Predicting multiple output variables based on multiple input variables

Hypothesis Testing – Differences Between Wilcoxon Rank Sum Test and Kolmogorov-Smirnov Test Results

Related Question