Stratified sampling means that the class membership distribution is preserved in your KFold sampling. This doesn't make a lot of sense in the multilabel case where your target vector might have more than one label per observation.
There are two possible interpretations of stratified in this sense.
For $n$ labels where at least one of them is filled that gives you $\sum\limits_{i=1}^n2^n$ unique labels. You could perform stratified sampling on the each of the unique label bins.
The other option is to try and segment the training data s.t. that probability mass of the distribution of the label vectors is approximately the same over the folds. E.g.
import numpy as np
np.random.seed(1)
y = np.random.randint(0, 2, (5000, 5))
y = y[np.where(y.sum(axis=1) != 0)[0]]
def proba_mass_split(y, folds=7):
obs, classes = y.shape
dist = y.sum(axis=0).astype('float')
dist /= dist.sum()
index_list = []
fold_dist = np.zeros((folds, classes), dtype='float')
for _ in xrange(folds):
index_list.append([])
for i in xrange(obs):
if i < folds:
target_fold = i
else:
normed_folds = fold_dist.T / fold_dist.sum(axis=1)
how_off = normed_folds.T - dist
target_fold = np.argmin(np.dot((y[i] - .5).reshape(1, -1), how_off.T))
fold_dist[target_fold] += y[i]
index_list[target_fold].append(i)
print("Fold distributions are")
print(fold_dist)
return index_list
if __name__ == '__main__':
proba_mass_split(y)
To get the normal training, testing indices that KFold produces you want to rewrite that to it returns the np.setdiff1d of each index with np.arange(y.shape[0]), then wrap that in a class with an iter method.
The Multi-label algorithm accepts a binary mask over multiple labels. So, for example, you could do something like this:
data = [
[[0.1 , 0.6, 0.0, 0.3], 1, 10, 0, 0, 0],
[[0.7 , 0.3, 0.0, 0.0], 0, 7, 22, 0, 0],
[[0.0 , 0.0, 0.6, 0.4], 0, 0, 6, 0, 20],
#...
]
X = np.array([d[1:] for d in data])
yvalues = np.array([d[0] for d in data])
# Create a binary array marking values as True or False
from sklearn.preprocessing import MultiLabelBinarizer
Y = MultiLabelBinarizer().fit_transform(yvalues)
clf = OneVsRestClassifier(SVC(kernel='poly'))
clf.fit(X, Y)
clf.predict(X) # predict on a new X
The result for each prediction will be an array of 0s and 1s marking which class labels apply to each row input sample.
Given your data, though, I'm not sure this is what you want to do. For example, the third point has zero listed twice, which makes me think that you're not predicting multiple labels in an unordered OneVsRest
manner, but actually predicting multiple ordered columns of labels: in that case, it might make sense to do a separate classification for each, e.g.
X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
clfs = [SVC().fit(X, Y[:, i]) for i in range(Y.shape[1])]
Ypred = np.array([clf.predict(X) for clf in clfs]).T
With other classifiers, such as RandomForestClassifier
, you can do this column-by-column prediction in one operation: e.g.
X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
RandomForestClassifier().fit(X, Y).predict(X)
Of course, the array passed to predict
should be on something different than the array passed to fit
, but hopefully this makes the distinction clear.
Best Answer
I think binary relevance should be done still in the way that you proposed.
For an overview of different methods for multilabel classification look here: https://journal.r-project.org/archive/2017/RJ-2017-012/RJ-2017-012.pdf I don't know how they could be adapted for multiclass multi-label problems. I would also rather call them multitarget problems.