Solved – identifying events (patterns) that occur before an event of interest using sequence of events data

pattern recognitionpythonrsas

I'm interested in identifying events (patterns) that occur before an event of interest.

For example, a customer calls in to complain or a customer checks their balance online then the customer closes the account (event of interest). Data is in a form of a sequence of events with a time stamp.

I would like to know what methodologies (and software) people are using for such a task.

I'm open to exact patterns (sequence of events), frequency of events (calls in multiple times to complain), and timing of events (80% of event A leads directly to event of interest within X days). Basically, open to any approach people are using to identify patterns leading up to an event.

Thus far I found the CSPADE algorithm as available in the arulesSequences package in R. It seems to be able to identify sequence of patterns and which items co-occur. However, I don't think one can set a target event for it to find a pattern.

I'm open to algorithms available in R, Python, or SAS.

Thanks so much!

Best Answer

I suppose you have some large training set available.

This problem can be tackled with many different approaches and usually there is a trade-off between how well you can interpret findings/model and how good predictions you can make. I made something similar recently and after having a complex non-linear classifier that worked pretty well, I was asked to identify the exact moments/events that trigger when we should interfere with customers, that was practically too difficult, so I ended up remodelling using trees in order to allow managers understand what's happening in their data. More details follow:

I start with approaches you can use to interpret what's going on: You will need to make a script to extract from the full dataset the sequences of length $n$ that lead to the event you are interested in. If $n$ is low and you have enough data you can measure frequencies of all combinations of events that lead to the event of interest. If situation is a bit more complex and there are no clear winners, you could try visualise these series of events using a tree as a form of a Markov chain. Using some ad-hoc cleaning, like removing edges with $p<0.1$, you could end-up with something actionable.

If interpreting the results is not so important, but the critical thing is to predict if event of interest is gonna happen, then I would go for classic classification. You have series of events that lead to the event of interest (this is your positive class). You make some series of events that didn't lead to the event of interest (that's the negative class). You could have a critirion like "if no event of interest happens after 50 normal events then I consider series a negative". Now features could be the frequencies of events, the frequencies of pairs of events, times between events (speed). You can consider the series of events as series of words and use standard approaches for text classification. Having classes, and features pick a classifier and do some modelling.

Related Question