I have data in form of $N$ sequences $s_j=(t_i, e_i)_{i\in\{1,\ldots,n_j\}}$ with $n_j$ data-points each, where $t_i$ is a time-stamp and $e_i$ is a categorial event, say $e_i\in\{A,B,C,D\}$. The $N$ sequences are independent.
I want to find short (i.e., $<<n_j$) patterns of events (say, "DAAB", i.e., I'm not interested in their relative timing) that occur multiple times both within and across the individual time-series. I am not an expert in sequential pattern mining but as far as I understand, those algorithms (GSP, SPADE,…?) require as input a list of subsequent event-lists rather than a continuous stream. I could, of course, try to split each sequence $s_j$ into shorter bits (based on temporal distance in terms of the time of occurence $t_i$) and use those algorithms (any hints on how to do that effectively are most welcome!). But I was wondering if anyone could point me to a method that can handle continuous streams?
I am also interested in methods that are robust against (or estimate directly) "mutations" and "deletions", i.e., that the algorithm could somehow pick up that "DAB", "DABB" and "DAAB" are alike, for example.
Best Answer
Although it is not intended for streaming data, it may worth to have a look at the
TraMineR
R package.With TraMineR you can, among others, find the most frequent subsequences using different counting methods (presence/absence in the sequence, multiple occurrences in each sequences, ...) and time constraints, find the frequent subsequences of length at most k and/or at least h (see help(seqefsub)).
I illustrate below how to search for subsequences with at most 3 elements using the time-stamped
actcal.tse
data that ships withTraMineR
(type help(actcal.tse) for details about the data):If you want only subsequences with at least 2 elements, you need the
seqentrans
function of theTraMineRextras
package:Regarding your second point, you could use the
seqdss
function ofTraMineR
to extract the sequences of distinct successive states (DSS), where each spell such asAAAA
in the sequence is replaced by a singleA
. for example, your three examplesDAB
,DAAB
andDABB
all have the same DSS. You could then compute distances between sequences of DSS.