Although it is not intended for streaming data, it may worth to have a look at the TraMineR
R package.
With TraMineR you can, among others, find the most frequent subsequences using different counting methods (presence/absence in the sequence, multiple occurrences in each sequences, ...) and time constraints, find the frequent subsequences of length at most k and/or at least h (see help(seqefsub)).
I illustrate below how to search for subsequences with at most 3 elements using the time-stamped actcal.tse
data that ships with TraMineR
(type help(actcal.tse) for details about the data):
library(TraMineR)
data(actcal.tse)
actcal.seqe <- seqecreate(id = actcal.tse$id, timestamp = actcal.tse$time,
event = actcal.tse$event)
fsub <- seqefsub(actcal.seqe, maxK = 3, pMinSupport = .01)
fsub
If you want only subsequences with at least 2 elements, you need the seqentrans
function of the TraMineRextras
package:
library(TraMineRextras)
fsub <- seqentrans(fsub)
fsub[fsub$data$nevent > 1]
Regarding your second point, you could use the seqdss
function of TraMineR
to extract the sequences of distinct successive states (DSS), where each spell such as AAAA
in the sequence is replaced by a single A
. for example, your three examples DAB
, DAAB
and DABB
all have the same DSS. You could then compute distances between sequences of DSS.
1) Talking about output, fraud detection is a subclass of anomaly detection, so classification is a correct, but not the most appropriate view. In this case we basically learn not just zeros and ones, but the whole distribution that lies under input data-points. It gets converted to a classification problem only after applying some threshold to the output probability in order to catch some rare event, so it could be used to either analyze anomaly or apply some supervised learning algorithm.
2) The representation of the output doesn't really matter so much here. Imagine classifier that takes a word in a text and outputs if it represents an animal:
hello => 0
kitty => 1
cartoon => 0
minny => 0
It could be represented as a RNN that takes the whole text as a stream:
hello kitty cartoon minny => 0 1 0 0
And in this case it could use precedence of word "hello" as a clue that "kitty" is going to be next with higher probability
3) Talking about input and RNNs, there is actually a correspondence between RNNs and CNNs (in image processing) as CNNs convolution/cross-correlation can be viewed as a particular case of RNN. This is actually a most popular way of applying (sometimes) RNNs to a non-sequential data
4) returning back to credit card case, if every input point is not actually i.i.d (identically and independently distributed) like in "hello kitty" case, then RNN could help by identifying this connection:
ok, attacker1.attack1, ok, ok, attacker1.attack2, ok, attacker2.attack
RNN could identify connection (correlation) between attacker1.attack1 and attacker1.attack2 (for instance credit card number brute-force) faster than regular network as those attacks, regardless that they could come from different IPs - would still be close in time and have something in common (let's say some digits could repeat or have some underlying generating function). Without RNN, feedforward network would have to use parameters from only one point
TLDR; RNN (especially Long-Term-Short-Term) allow you to specify a prior belief about relationships between events in "time", so you could benefit from it if those events are actually coupled (grouped) in some way even if they're not strictly sequential. Basically RNN provides a powerful way to apply linear/non-linear cross-"correlation" (much like CNN, but more general).
Best Answer
Calculate a histogram of N-grams and threshold at an appropriate level. In Python:
The N-gram calculation (lines three and four) are from this answer.
The example output is
So 72 is the most frequent two-digit subsequence in your example, occurring a total of five times. You can run the code for all $N$ you are interested about.