Solved – Feature construction and normalization in machine learning

feature-engineeringmachine learning

Lets say I want to create a Logistic Classifier for a movie M.
My features would be something like age of the person, gender, occupation, location.
So training set would be something like:

  • Age Gender Occupation Location Like(1)/Dislike(0)
  • 23 M Software US 1
  • 24 F Doctor UK 0

and so on….
Now my question is how should I scale and represent my features.
One way I thought:
Divide age as age groups, so 18-25, 25-35, 35-above, Gender as M,F, Location as US, UK, Others. Now create a binary feature for all these values, hence age will be having 3 binary features each corresponding to an age group and so on. So, a 28 years Male from US would be represented as 010 10 100 (010-> Age Group 25-35, 10 -> Male, 100 -> US)

What could be the best way to represent features here ?
Also, I noticed in some e.gs. of sklearn that all the features have been scaled/normalized in some way, e.g. Gender is represented by two values, 0.0045 and -.0.0045 for Male and female. I don't have any clue on how to do scaling/mormalization like this ?

Best Answer

Binary case

If you want your features to be binary, the good representations for categorical (resp. real) values are the one hot (resp. thermometer) encoding. You don't need to normalize them.

For the one hot encoding of a categorical feature, you simply reserve one bit for each class. The length of this encoding is therefore the number of classes of your feature. Lets take your example of country,

  • 00001 for US
  • 00010 for UK
  • 00100 for Asia
  • 01000 for Europe
  • 10000 for other

For the thermometer encoding of a real/integer feature, you have to choose a length and the thresholds. For your example of age, you've chosen to split age according to the thresholds 18,25 and 35. The coding will be

  • 000 for 0-17
  • 001 for 18-25
  • 011 for 25-34
  • 111 for 35-above

Putting both together, you obtain here an encoding of size 5+3=8 bits. For a 30 year old UK resident we have $$\overbrace{0 \cdot 0 \cdot 0 \cdot 1 \cdot 0}^{UK}\cdot \overbrace{0 \cdot 1 \cdot 1 }^{30yo}$$

Continuous case

If your regression model allows it, you should prefere to keep a real value for a real/integer feature which contains more information. Let's reconsider your example. This time we simply let the value for age as an integer. The encoding for a 30 year old UK resident is thus $$\overbrace{0 \cdot 0 \cdot 0 \cdot 1 \cdot 0 }^{UK}\cdot \overbrace{30 }^{30yo}$$

As BGreene said, you should then normalize this value to keep a mean of 0 and a standard deviation of 1, which insure stability of many regression models. In order to do that, simply subtract the empirical mean and divide by the empirical standard deviation.

Y_normalized = ( Y - mean(Y) ) / std(Y)

If the mean of all the age of all persons in your data base is 25, and its standard deviation is 10, the normalized value for a 30y.o. person will be $(30-25)/10 = 0.5$, leading to the representation $$\overbrace{0 \cdot 0 \cdot 0 \cdot 1 \cdot 0}^{UK}\cdot \overbrace{0.5 }^{30yo}$$

Related Question