Solved – Pre-processing time series data for data mining / predictive modeling input

data miningpredictive-modelstime series

What are some ways to prepare/pre-process time series data to use the series data as a predictor(s) in a predictive model (classification or regression)? Specifically, what are the methods to be considered in order to

  • Get the most predicatively useful signal from the data?
  • Reduce the dimensionality of the series

As a concrete example, I have 90 days of ending balance data (amount on deposit in a checking account). I want to use that data to predict if the owner of the account will close it in the next 2 weeks (I have an indicator for this occurring or not 2 weeks after the end of the series).

ADDITION:

After reviewing the responses, I think I was looking for a list of commonly used techniques.

  • Certainly there are the feature creation that Matt Krause wrote about (each customers balance series are treated separately in all these methods):
    Things like differences and % changes in the series values each day or week. There are sliding window aggrgations like weekly averages, min, max and standard variations. Also number of increases and decreases, indicators for changes in the balances (absolute or relative of a certain size).
  • I have considered fitting a linear or polynomial regression to each serie and using the coefficients in the model as predictors.
  • Other things I've wondered about is for each balance series compute the auto correlations for a maximum number of lags and use these values as predictors.
  • Cluster the time series into a relatively small number of values and use the indicators for cluster membership as predictors (use dynamic time warping distance and hierarchical clustering for example).

Are there others?

  • For example do Fourier Transforms work here? I may post a separate question about them.

Best Answer

I agree with what user765195 said: there's no magic bullet here that will work for all your problems. You've got to come up with potentially-useful features based on your domain knowledge. I've never worked in a bank, so take these suggestions with a grain of salt, but how about

  • "Volatility" When I've changed banks, I tend to use the new account for a while before I close the old one, since it takes a while for payroll/recurring charges to get moved over. Maybe the variance of the balance (or changes in the variance over shorter time windows) would capture this?

  • Transaction Size Taking the derivative of the daily balances would give you an idea of the (net) daily transactions. Maybe people make anamolously large withdrawals before closing out their accounts (e.g., to set up a new account elsewhere).

If I were you, I would start by making a long list of possible features. Carve your data up into a test set, a development set, and a training set. Test out new features on the training+development sets and see what works. Personally, I would throw them all in and see what happens first. There are lots of feature selection algorithms, ranging from the brain-dead but exhaustive (try all possible combinations!) to something like projection pursuit or hill-climbing, which might be more tractable for big data sets.

Then, once you've settled on a model, use the previously-untouched test data to evaluate its performance.