Solved – Is it a good idea to normalize the data consecutively in two different methods

classificationdata visualizationmachine learningmultilabelnormalization

Let's say I have a dataset with 5 features. One row is (just to give you an idea of the range of data in each column)
[200456, 76, 2, 1, 0, 9986]
First I normalize columns with mean = 0 and variance = 1. Then I scale the columns of the data from 0 to 1. This gives me really good results, but in theory, I am not very convinced that this is a good method of normalization. I am working with multi-label classification of data if that helps. Kindly let me know if more information is required.

Best Answer

The part about normalizing across rows pops out at me. It's usual to normalize a feature (column) so that, having done this for each feature, the features will be on more comparable scales. Normalizing across rows probably won't make any physical sense, and I'm not sure I can see any situation where it would be justified. (Imagine mashing a person's height, weight, and blood pressure together.)

Even if you normalize only the columns, note: if you normalize all of your data and then split it into train/test, you will get unrealistically better test results than you should. Your training data represents the data you have before you deploy your model and your test data represents the data that comes in after deployment. By normalizing across this boundary, you are allowing data from the future (test set) to leak into the present (training set). This can't and won't happen in the real world.

Related Solutions

Solved – scikit multi label classification

The Multi-label algorithm accepts a binary mask over multiple labels. So, for example, you could do something like this:

data = [
        [[0.1 , 0.6, 0.0, 0.3], 1, 10, 0, 0, 0],
        [[0.7 , 0.3, 0.0, 0.0], 0, 7, 22, 0, 0],
        [[0.0 , 0.0, 0.6, 0.4], 0, 0, 6, 0, 20],
        #...
       ]

X = np.array([d[1:] for d in data])
yvalues = np.array([d[0] for d in data])

# Create a binary array marking values as True or False
from sklearn.preprocessing import MultiLabelBinarizer
Y = MultiLabelBinarizer().fit_transform(yvalues)

clf = OneVsRestClassifier(SVC(kernel='poly'))
clf.fit(X, Y)
clf.predict(X)  # predict on a new X

The result for each prediction will be an array of 0s and 1s marking which class labels apply to each row input sample.

Given your data, though, I'm not sure this is what you want to do. For example, the third point has zero listed twice, which makes me think that you're not predicting multiple labels in an unordered OneVsRest manner, but actually predicting multiple ordered columns of labels: in that case, it might make sense to do a separate classification for each, e.g.

X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
clfs = [SVC().fit(X, Y[:, i]) for i in range(Y.shape[1])]
Ypred = np.array([clf.predict(X) for clf in clfs]).T

With other classifiers, such as RandomForestClassifier, you can do this column-by-column prediction in one operation: e.g.

X = np.array([d[1:] for d in data])
Y = np.array([d[0] for d in data])
RandomForestClassifier().fit(X, Y).predict(X)

Of course, the array passed to predict should be on something different than the array passed to fit, but hopefully this makes the distinction clear.

Solved – Feature normalization in Text Classification

First of all, your features are not word frequencies - they are just counts of words of each type in entire document, I assume. Word frequency (usually called "term frequency" instead) for a wordN is a number of wordN in text divided by it's total count of words.

Usually term frequency is a good feature for text classification. However row normalization in your case returns real words' frequencies only if all word types are represented as features. Otherwise you ignore any other words, and you even get a singularity if none of features are placed in document.

In real text classification problems we always filter features to exclude errata, articles, names and etc. and to reduce the model's dimension. That's why you should compute word frequencies this way. For most algorithms no further normalization is necessary, but specific data preprocessing may be recommended for some particular algorithms.

Best Answer

Related Solutions

Solved – scikit multi label classification

Solved – Feature normalization in Text Classification

Related Question