The Multi-label algorithm accepts a binary mask over multiple labels. So, for example, you could do something like this:
data = [
[[0.1 , 0.6, 0.0, 0.3], 1, 10, 0, 0, 0],
[[0.7 , 0.3, 0.0, 0.0], 0, 7, 22, 0, 0],
[[0.0 , 0.0, 0.6, 0.4], 0, 0, 6, 0, 20],
#...
]
X = np.array([d[1:] for d in data])
yvalues = np.array([d[0] for d in data])
# Create a binary array marking values as True or False
from sklearn.preprocessing import MultiLabelBinarizer
Y = MultiLabelBinarizer().fit_transform(yvalues)
clf = OneVsRestClassifier(SVC(kernel='poly'))
clf.fit(X, Y)
clf.predict(X) # predict on a new X
The result for each prediction will be an array of 0s and 1s marking which class labels apply to each row input sample.
Given your data, though, I'm not sure this is what you want to do. For example, the third point has zero listed twice, which makes me think that you're not predicting multiple labels in an unordered OneVsRest
manner, but actually predicting multiple ordered columns of labels: in that case, it might make sense to do a separate classification for each, e.g.
X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
clfs = [SVC().fit(X, Y[:, i]) for i in range(Y.shape[1])]
Ypred = np.array([clf.predict(X) for clf in clfs]).T
With other classifiers, such as RandomForestClassifier
, you can do this column-by-column prediction in one operation: e.g.
X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
RandomForestClassifier().fit(X, Y).predict(X)
Of course, the array passed to predict
should be on something different than the array passed to fit
, but hopefully this makes the distinction clear.
Note that there is a subtle but important difference between multilabel problems, in which each instance may belong to several classes, and multiclass problems, in which each instance belongs to one of $\geq 2$ classes. I will discuss both briefly, but based on the question I suspect you are referring to multiclass problems.
Multilabel problems can essentially be broken down into sets of binary problems without much loss of information. The only situations where a true multilabel formulation has advantages, at least in theory, is when some combinations of labels are simply not possible, which you cannot enforce cleanly in a set of binary learning problems.
Multiclass problems are often best not split up into a set of binary problems because some information may be lost. Ofcourse, this only applies if the learning technique effectively has a natural multiclass formulation (e.g., SVM does not, but neural networks and decision trees do). One of the big benefits of doing pure multiclass classification is that you typically have more data to learn from, which allows better discrimination between subtlely distinct classes. Additionally, there is often an obvious computational advantage in multiclass formulations which features no redundancy in contrast to sets of binary formulations (regardless of your binarization scheme, be it 1-v-1, 1-v-all, ...).
Best Answer
This is the best visualization I can attempt to use to describe multi-label KNN. Let me know if you disagree.
In the plot below, individuals are one or more of the labels: {blue, orange, green}. As you can see, some individuals are both blue and orange, some green and orange. For the test subject I point to with the red arrow, the 7 nearest neighbors are probed.
From examination of those 7 nearest neighbors, you get the histogram below, yielding a final ranking class order of: Blue=Orange > Green, meaning this test subject is blue or orange before it is green. I don't know how precisely this translates to class probabilities. Would love to learn more?