Feature Engineering – Machine Learning Feature Encoding Techniques

feature-engineeringmachine learningmissing data

I'm new to Machine Learning.

I've just finished the Coursera course. 🙂

And for my first practical attempt I wanted to "analyse" a local used cars selling website in order to compose a modal that would "predict" an end price.

And I have a problem with "encoding" car features:
Some of them are "discrete" ( make, model, gearbox encoding : 1 – manual, 2 – automatic, 3 – semi-automatic, fuel encoding: 1 – petrol, 2 – diesel, 3 – electro, etc ),
some are continuous ( engine volume, engine power, milage, etc ).

The issue is – some of these features might be absent as it is not compulsory to fill them all in.

My main question is: should I use some special value for representing a missing feature?

I don't feel like using "0" (zero) would do any good as "0 * x = 0" – absolutely any "theta" would do in this partical case. Should I set it to, say, "-1" or something? What is a common approach to this?

And what about feature scaling in that case?

Best Answer

For categorical variables code a new category of "missing"; for continuous variables set missing values to any constant value $a$ & add an indicator variable for missingness. E.g. let the linear predictor be given by $$\eta=\beta_0 + \beta_1 x_1 + \beta_2 q_1 + \ldots$$ When $x_1$ is not missing set $q_1$ to $0$: $$\eta=\beta_0 + \beta_1 x_1 + \ldots$$ When $x_1$ is missing set $q_1$ to $1$ & $x_1$ to $a$: $$\eta=\beta_0 + \beta_1 a + \beta_2 + \ldots$$ Note that whatever value you choose for $a$ affects only the interpretation of $\beta_2$, leaving the model's predictions unchanged.† The new predictor $(x_1, q_1)$ can be used in interactions, & $q_1$ alone; but not $x_1$ alone as it contains the arbitrary $a$.

Unfortunately, even when data are missing completely at random this technique leads to biased estimates when the predictors are correlated (i.e. almost always, with observational data). See Jones (1996), "Indicator & stratification methods for missing explanatory variables in multiple linear regression", JASA, 91, 433. Imputation of missing values is preferable. Little & Rubin (2002), Statistical Analysis with Missing Data, is a good introduction to the problems arising with missing data & techniques for dealing with them.

† Of course you need to be careful when using any techniques that penalizes coefficients according to their magnitude.