Solved – Can target encoding be performed on a multi-label classification problem

categorical datacategorical-encodingdimensionality reductionmachine learningmultilabel

Is there a way to perform target encoding on multi label (closed set) problems, obviously target encoding is used on multi-class problems all the time, but i'm wondering if it works for multi label problems as well? and if there are any specialized methods for it.

A naive approach would be to create a dataset with duplicate rows but different target variable values for each label they contain, and then do target encoding on that, but I'm not sure how effective this would be.

Is there an alternative approach to target encoding is also appreciated, i just want to lessen the mess of having a massive amount of cardinality on my categorical variables.

Best Answer

The approach you describe might work and is worth trying.

The other obvious approach that I am aware of is to have a separate target encoding for each target (i.e. instead of one extra column, creating more). Actually, we might only need (number of targets)-1 variables (e.g. in the binary classification with 2 classes, we only need 1 variable, the other one is implicitly not the others), but it's worth trying also with the number of labels, as otherwise the model needs to learn to sum the other encodings up.

Since multi-label classification (as opposed to multi-class) is kind of hard for non-neural-network models, I guess you'll be using a neural network. If so, there's some things you can do quite easily in neural networks that might help your neural network along: You could for each output explicitly have a regression equation of target encoding (or encodings in case you have multiple variables) for this target + inputs from the rest of the neural network (which you might give the target encoding as an input, again). Of course, if you have many categories for categorical variables, then in neural networks embeddings can be a super-attractive alternative for representing categories (you can of course try one, the other or both at the same time). And (in general) it can be worth trying to use the embeddings that a neural network learnt as a feature for other model types.

Related Solutions

Solved – How to use scikit-learn’s cross validation functions on multi-label classifiers

Stratified sampling means that the class membership distribution is preserved in your KFold sampling. This doesn't make a lot of sense in the multilabel case where your target vector might have more than one label per observation.

There are two possible interpretations of stratified in this sense.

For $n$ labels where at least one of them is filled that gives you $\sum\limits_{i=1}^n2^n$ unique labels. You could perform stratified sampling on the each of the unique label bins.

The other option is to try and segment the training data s.t. that probability mass of the distribution of the label vectors is approximately the same over the folds. E.g.

import numpy as np

np.random.seed(1)
y = np.random.randint(0, 2, (5000, 5))
y = y[np.where(y.sum(axis=1) != 0)[0]]


def proba_mass_split(y, folds=7):
    obs, classes = y.shape
    dist = y.sum(axis=0).astype('float')
    dist /= dist.sum()
    index_list = []
    fold_dist = np.zeros((folds, classes), dtype='float')
    for _ in xrange(folds):
        index_list.append([])
    for i in xrange(obs):
        if i < folds:
            target_fold = i
        else:
            normed_folds = fold_dist.T / fold_dist.sum(axis=1)
            how_off = normed_folds.T - dist
            target_fold = np.argmin(np.dot((y[i] - .5).reshape(1, -1), how_off.T))
        fold_dist[target_fold] += y[i]
        index_list[target_fold].append(i)
    print("Fold distributions are")
    print(fold_dist)
    return index_list

if __name__ == '__main__':
    proba_mass_split(y)

To get the normal training, testing indices that KFold produces you want to rewrite that to it returns the np.setdiff1d of each index with np.arange(y.shape[0]), then wrap that in a class with an iter method.

Classification – Multi-Label vs. Multi-Class Classification: Differences and Uses

Definitions.

In a classification task, your goal is to learn a mapping $h: X\rightarrow Y$ (with your favourite ML algorithm, e.g CNNs). We make two common distinctions:

Binary vs multiclass: In binary classification, $\left|Y\right|=2$ (e.g, a positive category, and a negative category). In multiclass classifcation, $\left|Y\right|=k$ for some $k\in\mathbb{N}$. In other words, this is just a matter of "how many possible answers are there".
Single-label vs multilabel: This refers to how many possible outcomes are possible for a single example $x\in X$. This refers to whether your chosen categories are mutually exclusive, or not. For example, if you are trying to predict the color of an object, then you're probably doing single label classification: a red object can not be a black object at the same time. On the other hand, if you're doing object detection in an image, then since one image can contain multiple objects in it, you're doing multi-label classification.

Effect on network architecture. The first distinction determines the number of output units (i.e, number of neurons in the final layer). The second distinction determines which choice of activation function for the final layer + loss function you should you. For single-label, the standard choice is softmax with categorical cross-entropy; for multi-label, switch to sigmoid activations with binary-cross entropy. See here for a more detailed discussion on this question.

Creating "hybrid" combinations. I'll describe an example similar to the one in your question. Suppose I'm trying to classify animals, and I'm interested in recognizing the following:

color (black, white, orange)
size (small, medium, large)
type (cat, dog, chimpanzee)

This looks confusing: some of the labels are mutually exclusive (an animal can't be both black and orange) and others aren't (it can be a black dog). In this case, the solution is to perform multi-class classification with $k=3\cdot 3=9$ (or generally, number of categories times the size of the largest category; in this case all categories were of equal length, 3). You just have to define the loss function carefully: You would apply a softmax activation for each group of 3 (each category) and compare that to the true label. I created a little sketch which I think makes it clear:

So the final loss is $L(\hat y, y)=CE_{color} + CE_{size}$. The entire idea here is that we exploited information about the structure of the labels (which are mutually exclusive and which aren't) to significantly reduce the number of outputs (from an exponential number - all combinations, in this case $3^3$ - to a multiplicative number, $3\cdot 3$).

Best Answer

Related Solutions

Solved – How to use scikit-learn’s cross validation functions on multi-label classifiers

Classification – Multi-Label vs. Multi-Class Classification: Differences and Uses

Related Question