Feature Engineering – Why Create New Features Based on Existing Features in Machine Learning

feature-engineeringmachine learning

I often see people create new features based on existing features on a machine learning problem.
For example, here : https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/ people have considered the size of a person’s family as a new feature, based on the number of brothers, sisters, and parents, which were existing features.

But what the point of this ? I don't understand why the creation of correlated new features is useful. Isn't it the job of the algorithm to do that on its own ?

Best Answer

The most simple example used to illustrate this is the XOR problem (see image below). Imagine that you are given data containing of $x$ and $y$ coordinated and the binary class to predict. You could expect your machine learning algorithm to find out the correct decision boundary by itself, but if you generated additional feature $z=xy$, then the problem becomes trivial as $z>0$ gives you nearly perfect decision criterion for classification and you used just simple arithmetic!

XOR problem

So while in many cases you could expect from the algorithm to find the solution, alternatively, by feature engineering you could simplify the problem. Simple problems are easier and faster to solve, and need less complicated algorithms. Simple algorithms are often more robust, the results are often more interpretable, they are more scalable (less computational resources, time to train, etc.), and portable. You can find more examples and explanations in the wonderful talk by Vincent D. Warmerdam, given on from PyData conference in London.

Moreover, don't believe everything the machine learning marketers tell you. In most cases, the algorithms won't "learn by themselves". You usually have limited time, resources, computational power, and the data has usually a limited size and is noisy, neither of these helps.

Taking this to the extreme, you could provide your data as photos of handwritten notes of the experiment result and pass them to the complicated neural network. It would first learn to recognize the data on pictures, then learn to understand it, and make predictions. To do so, you would need a powerful computer and lots of time for training and tuning the model and need huge amounts of data because of using a complicated neural network. Providing the data in a computer-readable format (as tables of numbers), simplifies the problem tremendously, since you don't need all the character recognition. You can think of feature engineering as a next step, where you transform the data in such a way to create meaningful features so that your algorithm has even less to figure out on its own. To give an analogy, it is like you wanted to read a book in a foreign language, so that you needed to learn the language first, versus reading it translated in the language that you understand.

In the Titanic data example, your algorithm would need to figure out that summing family members makes sense, to get the "family size" feature (yes, I'm personalizing it in here). This is an obvious feature for a human, but it is not obvious if you see the data as just some columns of the numbers. If you don't know what columns are meaningful when considered together with other columns, the algorithm could figure it out by trying each possible combination of such columns. Sure, we have clever ways of doing this, but still, it is much easier if the information is given to the algorithm right away.

Related Question