Solved – Mining patterns in continuous sequence

machine learningpattern recognitionsequence analysistraminer

I have data in form of $N$ sequences $s_j=(t_i, e_i)_{i\in\{1,\ldots,n_j\}}$ with $n_j$ data-points each, where $t_i$ is a time-stamp and $e_i$ is a categorial event, say $e_i\in\{A,B,C,D\}$. The $N$ sequences are independent.

I want to find short (i.e., $<<n_j$) patterns of events (say, "DAAB", i.e., I'm not interested in their relative timing) that occur multiple times both within and across the individual time-series. I am not an expert in sequential pattern mining but as far as I understand, those algorithms (GSP, SPADE,…?) require as input a list of subsequent event-lists rather than a continuous stream. I could, of course, try to split each sequence $s_j$ into shorter bits (based on temporal distance in terms of the time of occurence $t_i$) and use those algorithms (any hints on how to do that effectively are most welcome!). But I was wondering if anyone could point me to a method that can handle continuous streams?

I am also interested in methods that are robust against (or estimate directly) "mutations" and "deletions", i.e., that the algorithm could somehow pick up that "DAB", "DABB" and "DAAB" are alike, for example.

Best Answer

Although it is not intended for streaming data, it may worth to have a look at the TraMineR R package.

With TraMineR you can, among others, find the most frequent subsequences using different counting methods (presence/absence in the sequence, multiple occurrences in each sequences, ...) and time constraints, find the frequent subsequences of length at most k and/or at least h (see help(seqefsub)).

I illustrate below how to search for subsequences with at most 3 elements using the time-stamped actcal.tse data that ships with TraMineR (type help(actcal.tse) for details about the data):

library(TraMineR)
data(actcal.tse)
actcal.seqe <- seqecreate(id = actcal.tse$id, timestamp = actcal.tse$time, 
      event = actcal.tse$event)

fsub <- seqefsub(actcal.seqe, maxK = 3, pMinSupport = .01)
fsub

If you want only subsequences with at least 2 elements, you need the seqentransfunction of the TraMineRextras package:

library(TraMineRextras)
fsub <- seqentrans(fsub)
fsub[fsub$data$nevent > 1]

Regarding your second point, you could use the seqdss function of TraMineR to extract the sequences of distinct successive states (DSS), where each spell such as AAAA in the sequence is replaced by a single A. for example, your three examples DAB, DAAB and DABB all have the same DSS. You could then compute distances between sequences of DSS.

Related Solutions

Solved – identifying events (patterns) that occur before an event of interest using sequence of events data

I suppose you have some large training set available.

This problem can be tackled with many different approaches and usually there is a trade-off between how well you can interpret findings/model and how good predictions you can make. I made something similar recently and after having a complex non-linear classifier that worked pretty well, I was asked to identify the exact moments/events that trigger when we should interfere with customers, that was practically too difficult, so I ended up remodelling using trees in order to allow managers understand what's happening in their data. More details follow:

I start with approaches you can use to interpret what's going on: You will need to make a script to extract from the full dataset the sequences of length $n$ that lead to the event you are interested in. If $n$ is low and you have enough data you can measure frequencies of all combinations of events that lead to the event of interest. If situation is a bit more complex and there are no clear winners, you could try visualise these series of events using a tree as a form of a Markov chain. Using some ad-hoc cleaning, like removing edges with $p<0.1$, you could end-up with something actionable.

If interpreting the results is not so important, but the critical thing is to predict if event of interest is gonna happen, then I would go for classic classification. You have series of events that lead to the event of interest (this is your positive class). You make some series of events that didn't lead to the event of interest (that's the negative class). You could have a critirion like "if no event of interest happens after 50 normal events then I consider series a negative". Now features could be the frequencies of events, the frequencies of pairs of events, times between events (speed). You can consider the series of events as series of words and use standard approaches for text classification. Having classes, and features pick a classifier and do some modelling.

Solved – Comparing two sequence objects

This is possible as long as your sequence objects share a same alphabet.

You do it by merging the two sequence objects into a pooled object, and then using dissassoc with the indicator of the original set as group argument. To illustrate, I first create two separate objects of female and male from the mvad sequence object:

library(TraMineR)
data(mvad)
levels(mvad[,"male"]) <- c("female","male")

mvad.seq <- seqdef(mvad, 17:86)

male.seq <- mvad.seq[mvad$male=="male",]
    female.seq <- mvad.seq[mvad$male=="female",]

and merge the two objects into a single one

pooled.seq <- rbind(male.seq,female.seq)

We create an indicator of the originating set (that should comply with the order of the sequences in pooled.seq)

oset.male <- rep(1,nrow(male.seq))
oset.female <- rep(2,nrow(female.seq))
oset <- c(oset.male, oset.female)

Now, we compute pairwise dissimilarities from the pooled object and get the discrepancy analysis by means of dissassoc

lcs <- seqdist(pooled.seq, method="LCS")
dissassoc(lcs, group=oset)

Best Answer

Related Solutions

Solved – identifying events (patterns) that occur before an event of interest using sequence of events data

Solved – Comparing two sequence objects

Related Question