Feature Engineering – Machine Learning Feature Encoding Techniques

feature-engineeringmachine learningmissing data

I'm new to Machine Learning.

I've just finished the Coursera course. 🙂

And for my first practical attempt I wanted to "analyse" a local used cars selling website in order to compose a modal that would "predict" an end price.

And I have a problem with "encoding" car features:
Some of them are "discrete" ( make, model, gearbox encoding : 1 – manual, 2 – automatic, 3 – semi-automatic, fuel encoding: 1 – petrol, 2 – diesel, 3 – electro, etc ),
some are continuous ( engine volume, engine power, milage, etc ).

The issue is – some of these features might be absent as it is not compulsory to fill them all in.

My main question is: should I use some special value for representing a missing feature?

I don't feel like using "0" (zero) would do any good as "0 * x = 0" – absolutely any "theta" would do in this partical case. Should I set it to, say, "-1" or something? What is a common approach to this?

And what about feature scaling in that case?

Best Answer

For categorical variables code a new category of "missing"; for continuous variables set missing values to any constant value $a$ & add an indicator variable for missingness. E.g. let the linear predictor be given by $$\eta=\beta_0 + \beta_1 x_1 + \beta_2 q_1 + \ldots$$ When $x_1$ is not missing set $q_1$ to $0$: $$\eta=\beta_0 + \beta_1 x_1 + \ldots$$ When $x_1$ is missing set $q_1$ to $1$ & $x_1$ to $a$: $$\eta=\beta_0 + \beta_1 a + \beta_2 + \ldots$$ Note that whatever value you choose for $a$ affects only the interpretation of $\beta_2$, leaving the model's predictions unchanged.^† The new predictor $(x_1, q_1)$ can be used in interactions, & $q_1$ alone; but not $x_1$ alone as it contains the arbitrary $a$.

Unfortunately, even when data are missing completely at random this technique leads to biased estimates when the predictors are correlated (i.e. almost always, with observational data). See Jones (1996), "Indicator & stratification methods for missing explanatory variables in multiple linear regression", JASA, 91, 433. Imputation of missing values is preferable. Little & Rubin (2002), Statistical Analysis with Missing Data, is a good introduction to the problems arising with missing data & techniques for dealing with them.

† Of course you need to be careful when using any techniques that penalizes coefficients according to their magnitude.

Binary case

If you want your features to be binary, the good representations for categorical (resp. real) values are the one hot (resp. thermometer) encoding. You don't need to normalize them.

For the one hot encoding of a categorical feature, you simply reserve one bit for each class. The length of this encoding is therefore the number of classes of your feature. Lets take your example of country,

00001 for US
00010 for UK
00100 for Asia
01000 for Europe
10000 for other

For the thermometer encoding of a real/integer feature, you have to choose a length and the thresholds. For your example of age, you've chosen to split age according to the thresholds 18,25 and 35. The coding will be

000 for 0-17
001 for 18-25
011 for 25-34
111 for 35-above

Putting both together, you obtain here an encoding of size 5+3=8 bits. For a 30 year old UK resident we have $$\overbrace{0 \cdot 0 \cdot 0 \cdot 1 \cdot 0}^{UK}\cdot \overbrace{0 \cdot 1 \cdot 1 }^{30yo}$$

Continuous case

If your regression model allows it, you should prefere to keep a real value for a real/integer feature which contains more information. Let's reconsider your example. This time we simply let the value for age as an integer. The encoding for a 30 year old UK resident is thus $$\overbrace{0 \cdot 0 \cdot 0 \cdot 1 \cdot 0 }^{UK}\cdot \overbrace{30 }^{30yo}$$

As BGreene said, you should then normalize this value to keep a mean of 0 and a standard deviation of 1, which insure stability of many regression models. In order to do that, simply subtract the empirical mean and divide by the empirical standard deviation.

Y_normalized = ( Y - mean(Y) ) / std(Y)

If the mean of all the age of all persons in your data base is 25, and its standard deviation is 10, the normalized value for a 30y.o. person will be $(30-25)/10 = 0.5$, leading to the representation $$\overbrace{0 \cdot 0 \cdot 0 \cdot 1 \cdot 0}^{UK}\cdot \overbrace{0.5 }^{30yo}$$

Machine Learning – Best Practices for Encoding Datetime

You want to preserve the cyclical nature of your inputs. One approach is to cut the datetime variable into four variables: year, month, day, and hour. Then, decompose each of these (except for year) variables in two.

You create a sine and a cosine facet of each of these three variables (i.e., month, day, hour), which will retain the fact that hour 24 is closer to hour 0 than to hour 21, and that month 12 is closer to month 1 than to month 10.

A quick Google search got me a few links on how to do it:

Best Answer

Related Solutions

Solved – Feature construction and normalization in machine learning

Binary case

Continuous case

Machine Learning – Best Practices for Encoding Datetime

Related Question