Solved – Mining patterns in continuous sequence

machine learningpattern recognitionsequence analysistraminer

I have data in form of $N$ sequences $s_j=(t_i, e_i)_{i\in\{1,\ldots,n_j\}}$ with $n_j$ data-points each, where $t_i$ is a time-stamp and $e_i$ is a categorial event, say $e_i\in\{A,B,C,D\}$. The $N$ sequences are independent.

I want to find short (i.e., $<<n_j$) patterns of events (say, "DAAB", i.e., I'm not interested in their relative timing) that occur multiple times both within and across the individual time-series. I am not an expert in sequential pattern mining but as far as I understand, those algorithms (GSP, SPADE,…?) require as input a list of subsequent event-lists rather than a continuous stream. I could, of course, try to split each sequence $s_j$ into shorter bits (based on temporal distance in terms of the time of occurence $t_i$) and use those algorithms (any hints on how to do that effectively are most welcome!). But I was wondering if anyone could point me to a method that can handle continuous streams?

I am also interested in methods that are robust against (or estimate directly) "mutations" and "deletions", i.e., that the algorithm could somehow pick up that "DAB", "DABB" and "DAAB" are alike, for example.

Best Answer

Although it is not intended for streaming data, it may worth to have a look at the TraMineR R package.

With TraMineR you can, among others, find the most frequent subsequences using different counting methods (presence/absence in the sequence, multiple occurrences in each sequences, ...) and time constraints, find the frequent subsequences of length at most k and/or at least h (see help(seqefsub)).

I illustrate below how to search for subsequences with at most 3 elements using the time-stamped actcal.tse data that ships with TraMineR (type help(actcal.tse) for details about the data):

library(TraMineR)
data(actcal.tse)
actcal.seqe <- seqecreate(id = actcal.tse$id, timestamp = actcal.tse$time, 
      event = actcal.tse$event)

fsub <- seqefsub(actcal.seqe, maxK = 3, pMinSupport = .01)
fsub

If you want only subsequences with at least 2 elements, you need the seqentransfunction of the TraMineRextras package:

library(TraMineRextras)
fsub <- seqentrans(fsub)
fsub[fsub$data$nevent > 1]

Regarding your second point, you could use the seqdss function of TraMineR to extract the sequences of distinct successive states (DSS), where each spell such as AAAA in the sequence is replaced by a single A. for example, your three examples DAB, DAAB and DABB all have the same DSS. You could then compute distances between sequences of DSS.