The Multi-label algorithm accepts a binary mask over multiple labels. So, for example, you could do something like this:
data = [
[[0.1 , 0.6, 0.0, 0.3], 1, 10, 0, 0, 0],
[[0.7 , 0.3, 0.0, 0.0], 0, 7, 22, 0, 0],
[[0.0 , 0.0, 0.6, 0.4], 0, 0, 6, 0, 20],
#...
]
X = np.array([d[1:] for d in data])
yvalues = np.array([d[0] for d in data])
# Create a binary array marking values as True or False
from sklearn.preprocessing import MultiLabelBinarizer
Y = MultiLabelBinarizer().fit_transform(yvalues)
clf = OneVsRestClassifier(SVC(kernel='poly'))
clf.fit(X, Y)
clf.predict(X) # predict on a new X
The result for each prediction will be an array of 0s and 1s marking which class labels apply to each row input sample.
Given your data, though, I'm not sure this is what you want to do. For example, the third point has zero listed twice, which makes me think that you're not predicting multiple labels in an unordered OneVsRest
manner, but actually predicting multiple ordered columns of labels: in that case, it might make sense to do a separate classification for each, e.g.
X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
clfs = [SVC().fit(X, Y[:, i]) for i in range(Y.shape[1])]
Ypred = np.array([clf.predict(X) for clf in clfs]).T
With other classifiers, such as RandomForestClassifier
, you can do this column-by-column prediction in one operation: e.g.
X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
RandomForestClassifier().fit(X, Y).predict(X)
Of course, the array passed to predict
should be on something different than the array passed to fit
, but hopefully this makes the distinction clear.
This is not the only way, and it may not work for all problems, but one solution would be to compare the performance of a range of class numbers (the current number and one more, or the current number and one either side, or two either side - the number of classes you explore each update depends on how much computational effort you can spare) and use an information criterion (e.g. corrected Akaike's Information Criterion, AICc) to assess goodness-of-fit for each alternative. The model with the lowest AICc is the 'best' fit, although trivial differences (delta-AICc smaller than about 5-10) are not sufficient to conclude that either model is substantially better. You could go one step further and calculate relative likelihoods for different alternatives using Akaike weights.
I'd recommend taking a look at Burnham and Anderson (2002) "Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach".
Best Answer
Definitions.
In a classification task, your goal is to learn a mapping $h: X\rightarrow Y$ (with your favourite ML algorithm, e.g CNNs). We make two common distinctions:
Effect on network architecture. The first distinction determines the number of output units (i.e, number of neurons in the final layer). The second distinction determines which choice of activation function for the final layer + loss function you should you. For single-label, the standard choice is softmax with categorical cross-entropy; for multi-label, switch to sigmoid activations with binary-cross entropy. See here for a more detailed discussion on this question.
Creating "hybrid" combinations. I'll describe an example similar to the one in your question. Suppose I'm trying to classify animals, and I'm interested in recognizing the following:
This looks confusing: some of the labels are mutually exclusive (an animal can't be both black and orange) and others aren't (it can be a black dog). In this case, the solution is to perform multi-class classification with $k=3\cdot 3=9$ (or generally, number of categories times the size of the largest category; in this case all categories were of equal length, 3). You just have to define the loss function carefully: You would apply a softmax activation for each group of 3 (each category) and compare that to the true label. I created a little sketch which I think makes it clear:
So the final loss is $L(\hat y, y)=CE_{color} + CE_{size}$. The entire idea here is that we exploited information about the structure of the labels (which are mutually exclusive and which aren't) to significantly reduce the number of outputs (from an exponential number - all combinations, in this case $3^3$ - to a multiplicative number, $3\cdot 3$).