Feature Engineering – Choosing Between Lag Based Numerical Features or ID Categorical Variables

feature-engineeringlagsmachine learningpredictive-modelsregression

I have to develop a Machine Learning regression model to predict customer’s delay in paying invoices. In addition to the invoice related variables, of course a very important variable is the customer.

One possibility is to use directly the customer as categorical predictor variable, an alternative solution is to use some customer related Lag Based Features, such as the number of invoices previously paid by the customer and the average delay on them.

Of course I can try both the methods and compare the accuracy (ID categorical variable vs lag based numerical features), but what are the implications, pros and cons of the 2 alternative approaches?

I did a Google search and I was impressed that I could not find any resources on this topic.

Best Answer

If you have a large number of customers and you can consider a linear model, you may choose to use a random effects term for customer. Random effects are great for categorical variables with large number of levels. Without them, you may see huge and unrealistic differences for different customers, especially if there isn't a lot of data for some of them.

You can also include lag terms for previous payment delay of the customers. You may have to write some code to create these features. Perhaps columns for (delay in customer's previous order) and (delay in customer's order 2 previous), ...

An interaction between the customer variable and the lagged variable will allow each customer to have their own contribution of this history. However if you have a large number of customer and lag terms, this will explode the number of features. I would start small and try only 2 lags to start.

An alternative to including multiple lags is to calculate the exponential moving average of the previous payment delays for each order within each customer. For example using R function TTR::EMA. You will have to assume a value for the constant. But this has the advance of less features, and smooth contribution from larger set of previous order information.

Related Question