Feature Engineering – How to Generate Useful Features from Useless Ones in Machine Learning

classificationfeature-engineeringinteractionmachine learningneural networks

I am new to ML and trying to work on a binary classification problem.

I came to know that one of the main factors for success of ML is feature engineering. I am here to seek some inspiration/help from experienced folks here on how to generate creative/interesting features based on existing features. My dataset is given below

enter image description here

In my dataset, I have already created some new features based on existing date columns (existing date columns are not shown in screenshot. Only derived date columns are shown here). Refer below

date_diff_1 – indicates the time taken by company to respond to customer in days.

date_diff_2 – indicates the time taken by customer to book his 1st PO after the deal is made.

date_diff_3 – indicates the time taken by customer to hit 1000$ revenue

Additionally, Division and category-Division variables are correlated. So, I plan to remove the highly cardinal column category-Division.

How does one approach feature engineering in scenarios like this? what are some of the usual/typical practice that data scientists follow to create interesting features to uncover hidden info (from the data)?

Best Answer

I think Arthur's answer is quite valuable here, but I think some more detail can be added.

Let's keep in mind that feature engineering is not the same as data collection. You should absolutely collect more data and more variables/information if you can. However, at the stage of feature engineering, we're assuming you already have all the raw data you could get.

During feature engineering, your job is to select, transform and combine these raw data variables in ways that are helpful to the model.

The process that Arthur outlines is absolutely correct - to engineer good features, you need to have some subject matter input (someone who understands exactly what each variable means, how it was gathered etc.) . This way, you can combine this expertise with some creativity to find new ways of representing the data.

Let's move away from these somewhat abstract definitions, and let me give you some potential ideas for features.

Most fundamental thing to keep in mind: It's a huge range of potential here. As long as you're not leaking data, you can try literally whatever you can think of. There are no boundaries. If you have an idea that you think makes sense in the context of the problem, try it!

Some ideas I had, given your dataset (adjust these to your understanding of the data, as I might not fully know your situation):

  1. Average booking QTY of the customer (up until a point in time, remember - no data leaking from the future, don't average across the whole dataset, only the data you would've had at that point)
  2. Average size of order or price of item of the customer (up until a point in time, remember - no data leaking from the future, don't average across the whole dataset, only the data you would've had at that point. As I hope you realize, this comment applies to ALL the features I mention, and all features you can think of.)
  3. Number of orders this customer has made with you
  4. With your date_diff variables, you could again take averages (like average time to respond to this customer etc., rather than just the most recent response time)
  5. You could take standard deviations of these date_diff variables (to see how consistent it is over time)
  6. If you have the data for it, you could look at average of actual revenue realized vs. your "revenue expected" column ("On average, the realized revenue is x% of expected revenue, for this customer/category)
  7. You could look at any of the above variables and calculate them only for the customers within a certain category, and then compare your given customer to all other customers in that category (so for example, look at the number of orders customer X, in category Z, has made with you COMPARED to the average number of orders ALL customers in category Z have made with you. You could express this number as a percentile).

I think this gives some pointers and inspiration!

Related Question