The Multi-label algorithm accepts a binary mask over multiple labels. So, for example, you could do something like this:
data = [
[[0.1 , 0.6, 0.0, 0.3], 1, 10, 0, 0, 0],
[[0.7 , 0.3, 0.0, 0.0], 0, 7, 22, 0, 0],
[[0.0 , 0.0, 0.6, 0.4], 0, 0, 6, 0, 20],
#...
]
X = np.array([d[1:] for d in data])
yvalues = np.array([d[0] for d in data])
# Create a binary array marking values as True or False
from sklearn.preprocessing import MultiLabelBinarizer
Y = MultiLabelBinarizer().fit_transform(yvalues)
clf = OneVsRestClassifier(SVC(kernel='poly'))
clf.fit(X, Y)
clf.predict(X) # predict on a new X
The result for each prediction will be an array of 0s and 1s marking which class labels apply to each row input sample.
Given your data, though, I'm not sure this is what you want to do. For example, the third point has zero listed twice, which makes me think that you're not predicting multiple labels in an unordered OneVsRest
manner, but actually predicting multiple ordered columns of labels: in that case, it might make sense to do a separate classification for each, e.g.
X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
clfs = [SVC().fit(X, Y[:, i]) for i in range(Y.shape[1])]
Ypred = np.array([clf.predict(X) for clf in clfs]).T
With other classifiers, such as RandomForestClassifier
, you can do this column-by-column prediction in one operation: e.g.
X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
RandomForestClassifier().fit(X, Y).predict(X)
Of course, the array passed to predict
should be on something different than the array passed to fit
, but hopefully this makes the distinction clear.
First of all, your features are not word frequencies - they are just counts of words of each type in entire document, I assume. Word frequency (usually called "term frequency" instead) for a wordN is a number of wordN in text divided by it's total count of words.
Usually term frequency is a good feature for text classification. However row normalization in your case returns real words' frequencies only if all word types are represented as features. Otherwise you ignore any other words, and you even get a singularity if none of features are placed in document.
In real text classification problems we always filter features to exclude errata, articles, names and etc. and to reduce the model's dimension.
That's why you should compute word frequencies this way. For most algorithms no further normalization is necessary, but specific data preprocessing may be recommended for some particular algorithms.
Best Answer
The part about normalizing across rows pops out at me. It's usual to normalize a feature (column) so that, having done this for each feature, the features will be on more comparable scales. Normalizing across rows probably won't make any physical sense, and I'm not sure I can see any situation where it would be justified. (Imagine mashing a person's height, weight, and blood pressure together.)
Even if you normalize only the columns, note: if you normalize all of your data and then split it into train/test, you will get unrealistically better test results than you should. Your training data represents the data you have before you deploy your model and your test data represents the data that comes in after deployment. By normalizing across this boundary, you are allowing data from the future (test set) to leak into the present (training set). This can't and won't happen in the real world.