Solved – Should I normalize featurewise or samplewise

data preprocessingmachine learningnormalization

It might be a beginner question, but I'm not sure how to normalize my data.

Let's suppose I have a NxM matrix with N samples of M dimensions each. If I want to normalize my data I can do it in two ways:

Samplewise: I take each sample and normalize it's features such as the end up being a unit vector (L2) or they just sum 1 (L1)

Featurewise: I take each feature and normalize it's values across all samples.

The problem I see is that in both cases I will end up loosing some relationship information.

Let's see an example:

              Height    Arm_length
Subject_1       180      20
Subject_2       190      40

If I normalize rowwise:

                 Height          Arm_length
Subject_1       180/200 = 0.9   20/200 = 0.1
Subject_2       190/250 = 0.76  40/250 = 0.16

Here you can see that even if the Subject_1 is shorter than the subject_2, when normalizing subject_2 ends up being taller (since my normalization is independent between samples)

If I normalize columnwise:

                 Height          Age
Subject_1       180/370 = 0.49   20/60 = 0.33
Subject_2       190/370 = 0.51   40/60 = 0.67

Here I can see that even if subject_2 has a way lower value for arm_length than height, it ends up with a higher value for arm_length than height (0.67 vs 0.51)

Also normalizing I loose the absolute values and end up only with relationships.

Image a system that depends not only on the absolute height and arm_length but also in the relationship between them.

So basically my question is: Should I normalize at all? If yes, columnwise, or rowwise?

Also, would it be a good idea to normalize both ways and append both into a new 2*M dimensional feature vector?

EDIT:

The relationship between features is definetely important.
Imagine a system where different body shapes behave differently, in such case a relation between Chest feature and Waist feature will be extremely important.

By normalizing featurewise I'll loose this relationship.

Thanks

Best Answer

In business, data is mostly normalized feature-wise as the aim is to study relationship across samples and being able to predict well about new samples. However, if your question aims at understanding relationship across features (which I haven't experienced yet), it would be a different scenario.

To classify people by their height to arm length ratio, I would suggest to introduce a new feature as 'height to arm length ratio' before normalization or standardization (you can find mathematical formulas at https://stats.stackexchange.com/a/10298) and then proceed.

Hope this helps!

Related Question