Solved – How to plot visualization for multi-label k-Nearest Neighbor

k nearest neighbourmachine learningmultilabel

I am studying multi-label learning methods, where for a given observation, you can assign more than one (a set of) target labels. One example is multi-label k-Nearest Neighbor.

I am seeking a way to describe to a panel of people unfamiliar with multi-label learning methods, a way to visualize how they work. For multi-label kNN, I need a visualization, much like the single-label multi-label approach found here: How to plot decision boundary of a k-nearest neighbor classifier from Elements of Statistical Learning?.

Note: This is not a duplicate of the above question (that I linked to above), because this is a multi-label version of kNN. The single-label solution is an intuitive visualization process, but the multi-label version is giving me trouble.

Can anyone help me understand how to visualize the predictions from a multi-label k-Nearest Neighbor classifier?

Best Answer

This is the best visualization I can attempt to use to describe multi-label KNN. Let me know if you disagree.

In the plot below, individuals are one or more of the labels: {blue, orange, green}. As you can see, some individuals are both blue and orange, some green and orange. For the test subject I point to with the red arrow, the 7 nearest neighbors are probed.

From examination of those 7 nearest neighbors, you get the histogram below, yielding a final ranking class order of: Blue=Orange > Green, meaning this test subject is blue or orange before it is green. I don't know how precisely this translates to class probabilities. Would love to learn more?

Related Solutions

Solved – scikit multi label classification

The Multi-label algorithm accepts a binary mask over multiple labels. So, for example, you could do something like this:

data = [
        [[0.1 , 0.6, 0.0, 0.3], 1, 10, 0, 0, 0],
        [[0.7 , 0.3, 0.0, 0.0], 0, 7, 22, 0, 0],
        [[0.0 , 0.0, 0.6, 0.4], 0, 0, 6, 0, 20],
        #...
       ]

X = np.array([d[1:] for d in data])
yvalues = np.array([d[0] for d in data])

# Create a binary array marking values as True or False
from sklearn.preprocessing import MultiLabelBinarizer
Y = MultiLabelBinarizer().fit_transform(yvalues)

clf = OneVsRestClassifier(SVC(kernel='poly'))
clf.fit(X, Y)
clf.predict(X)  # predict on a new X

The result for each prediction will be an array of 0s and 1s marking which class labels apply to each row input sample.

Given your data, though, I'm not sure this is what you want to do. For example, the third point has zero listed twice, which makes me think that you're not predicting multiple labels in an unordered OneVsRest manner, but actually predicting multiple ordered columns of labels: in that case, it might make sense to do a separate classification for each, e.g.

X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
clfs = [SVC().fit(X, Y[:, i]) for i in range(Y.shape[1])]
Ypred = np.array([clf.predict(X) for clf in clfs]).T

With other classifiers, such as RandomForestClassifier, you can do this column-by-column prediction in one operation: e.g.

X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
RandomForestClassifier().fit(X, Y).predict(X)

Of course, the array passed to predict should be on something different than the array passed to fit, but hopefully this makes the distinction clear.

Solved – Rationale for Multi-Label vs. Single-Label learning

Note that there is a subtle but important difference between multilabel problems, in which each instance may belong to several classes, and multiclass problems, in which each instance belongs to one of $\geq 2$ classes. I will discuss both briefly, but based on the question I suspect you are referring to multiclass problems.

Multilabel problems can essentially be broken down into sets of binary problems without much loss of information. The only situations where a true multilabel formulation has advantages, at least in theory, is when some combinations of labels are simply not possible, which you cannot enforce cleanly in a set of binary learning problems.

Multiclass problems are often best not split up into a set of binary problems because some information may be lost. Ofcourse, this only applies if the learning technique effectively has a natural multiclass formulation (e.g., SVM does not, but neural networks and decision trees do). One of the big benefits of doing pure multiclass classification is that you typically have more data to learn from, which allows better discrimination between subtlely distinct classes. Additionally, there is often an obvious computational advantage in multiclass formulations which features no redundancy in contrast to sets of binary formulations (regardless of your binarization scheme, be it 1-v-1, 1-v-all, ...).

Best Answer

Related Solutions

Solved – scikit multi label classification

Solved – Rationale for Multi-Label vs. Single-Label learning

Related Question