Solved – identifying events (patterns) that occur before an event of interest using sequence of events data

pattern recognitionpythonrsas

I'm interested in identifying events (patterns) that occur before an event of interest.

For example, a customer calls in to complain or a customer checks their balance online then the customer closes the account (event of interest). Data is in a form of a sequence of events with a time stamp.

I would like to know what methodologies (and software) people are using for such a task.

I'm open to exact patterns (sequence of events), frequency of events (calls in multiple times to complain), and timing of events (80% of event A leads directly to event of interest within X days). Basically, open to any approach people are using to identify patterns leading up to an event.

Thus far I found the CSPADE algorithm as available in the arulesSequences package in R. It seems to be able to identify sequence of patterns and which items co-occur. However, I don't think one can set a target event for it to find a pattern.

I'm open to algorithms available in R, Python, or SAS.

Thanks so much!

Best Answer

I suppose you have some large training set available.

This problem can be tackled with many different approaches and usually there is a trade-off between how well you can interpret findings/model and how good predictions you can make. I made something similar recently and after having a complex non-linear classifier that worked pretty well, I was asked to identify the exact moments/events that trigger when we should interfere with customers, that was practically too difficult, so I ended up remodelling using trees in order to allow managers understand what's happening in their data. More details follow:

I start with approaches you can use to interpret what's going on: You will need to make a script to extract from the full dataset the sequences of length $n$ that lead to the event you are interested in. If $n$ is low and you have enough data you can measure frequencies of all combinations of events that lead to the event of interest. If situation is a bit more complex and there are no clear winners, you could try visualise these series of events using a tree as a form of a Markov chain. Using some ad-hoc cleaning, like removing edges with $p<0.1$, you could end-up with something actionable.

If interpreting the results is not so important, but the critical thing is to predict if event of interest is gonna happen, then I would go for classic classification. You have series of events that lead to the event of interest (this is your positive class). You make some series of events that didn't lead to the event of interest (that's the negative class). You could have a critirion like "if no event of interest happens after 50 normal events then I consider series a negative". Now features could be the frequencies of events, the frequencies of pairs of events, times between events (speed). You can consider the series of events as series of words and use standard approaches for text classification. Having classes, and features pick a classifier and do some modelling.

Related Solutions

Solved – Mining patterns in continuous sequence

Although it is not intended for streaming data, it may worth to have a look at the TraMineR R package.

With TraMineR you can, among others, find the most frequent subsequences using different counting methods (presence/absence in the sequence, multiple occurrences in each sequences, ...) and time constraints, find the frequent subsequences of length at most k and/or at least h (see help(seqefsub)).

I illustrate below how to search for subsequences with at most 3 elements using the time-stamped actcal.tse data that ships with TraMineR (type help(actcal.tse) for details about the data):

library(TraMineR)
data(actcal.tse)
actcal.seqe <- seqecreate(id = actcal.tse$id, timestamp = actcal.tse$time, 
      event = actcal.tse$event)

fsub <- seqefsub(actcal.seqe, maxK = 3, pMinSupport = .01)
fsub

If you want only subsequences with at least 2 elements, you need the seqentransfunction of the TraMineRextras package:

library(TraMineRextras)
fsub <- seqentrans(fsub)
fsub[fsub$data$nevent > 1]

Regarding your second point, you could use the seqdss function of TraMineR to extract the sequences of distinct successive states (DSS), where each spell such as AAAA in the sequence is replaced by a single A. for example, your three examples DAB, DAAB and DABB all have the same DSS. You could then compute distances between sequences of DSS.

Solved – Pattern identification in time series data

If you don't have training dataset and don't have a result set that you want to get but rather understand a pattern, clustering algorithms might be your best bet. There are many clustering algorithms with pros and cons of each but you would be able to feed the data in and you'd get a clusters based on distance. Since it seems that you only have categorical data, you might need to use something like k-modes algorithm which would give you a heuristic solution.

Best Answer

Related Solutions

Solved – Mining patterns in continuous sequence

Solved – Pattern identification in time series data

Related Question