Solved – Sequential pattern mining on single sequence

machine learningsequence analysisunsupervised learning

Can someone give me a hint about a good approach to find a frequent patterns in a single sequence.

For example there is the single sequence

3 6 1 2 7 3 8 9 7 2 2 0 2 7 2 8 4 8 9 7 2 4 1 0 3 2 7 2 0 3 8 9 7 2 0

I am looking for a method that can detect frequent patterns in this ordered sequence:

3 6 1 [2 7] 3 [8 9 7 2] 2 0 [2 7] 2 8 4 [8 9 7 2] 4 1 0 3 [2 7] 2 0 3 [8 9 7 2] 0

Also other information would be interesting like:

What is the probability that 7 comes after 2
When each number has a timestamp assigned to it, what is the estimated time interval that 7 occurs after 2

The sequential pattern mining methods I found require multiple sequences, but I have one large sequence where I want to detect regularities.

Thanks for any help!

Best Answer

Calculate a histogram of N-grams and threshold at an appropriate level. In Python:

from scipy.stats import itemfreq
s = '36127389722027284897241032720389720'
N = 2 # bi-grams
grams = [s[i:i+N] for i in xrange(len(s)-N)]
print itemfreq(grams)

The N-gram calculation (lines three and four) are from this answer.

The example output is

[['02' '1']
 ['03' '2']
 ['10' '1']
 ['12' '1']
 ['20' '2']
 ['22' '1']
 ['24' '1']
 ['27' '3']
 ['28' '1']
 ['32' '1']
 ['36' '1']
 ['38' '2']
 ['41' '1']
 ['48' '1']
 ['61' '1']
 ['72' '5']
 ['73' '1']
 ['84' '1']
 ['89' '3']
 ['97' '3']]

So 72 is the most frequent two-digit subsequence in your example, occurring a total of five times. You can run the code for all $N$ you are interested about.

Related Solutions

Solved – Mining patterns in continuous sequence

Although it is not intended for streaming data, it may worth to have a look at the TraMineR R package.

With TraMineR you can, among others, find the most frequent subsequences using different counting methods (presence/absence in the sequence, multiple occurrences in each sequences, ...) and time constraints, find the frequent subsequences of length at most k and/or at least h (see help(seqefsub)).

I illustrate below how to search for subsequences with at most 3 elements using the time-stamped actcal.tse data that ships with TraMineR (type help(actcal.tse) for details about the data):

library(TraMineR)
data(actcal.tse)
actcal.seqe <- seqecreate(id = actcal.tse$id, timestamp = actcal.tse$time, 
      event = actcal.tse$event)

fsub <- seqefsub(actcal.seqe, maxK = 3, pMinSupport = .01)
fsub

If you want only subsequences with at least 2 elements, you need the seqentransfunction of the TraMineRextras package:

library(TraMineRextras)
fsub <- seqentrans(fsub)
fsub[fsub$data$nevent > 1]

Regarding your second point, you could use the seqdss function of TraMineR to extract the sequences of distinct successive states (DSS), where each spell such as AAAA in the sequence is replaced by a single A. for example, your three examples DAB, DAAB and DABB all have the same DSS. You could then compute distances between sequences of DSS.

Solved – RNNs for Non-sequential Data

1) Talking about output, fraud detection is a subclass of anomaly detection, so classification is a correct, but not the most appropriate view. In this case we basically learn not just zeros and ones, but the whole distribution that lies under input data-points. It gets converted to a classification problem only after applying some threshold to the output probability in order to catch some rare event, so it could be used to either analyze anomaly or apply some supervised learning algorithm.

2) The representation of the output doesn't really matter so much here. Imagine classifier that takes a word in a text and outputs if it represents an animal:

hello => 0
kitty => 1
cartoon => 0
minny => 0

It could be represented as a RNN that takes the whole text as a stream:

hello kitty cartoon minny => 0 1 0 0

And in this case it could use precedence of word "hello" as a clue that "kitty" is going to be next with higher probability

3) Talking about input and RNNs, there is actually a correspondence between RNNs and CNNs (in image processing) as CNNs convolution/cross-correlation can be viewed as a particular case of RNN. This is actually a most popular way of applying (sometimes) RNNs to a non-sequential data

4) returning back to credit card case, if every input point is not actually i.i.d (identically and independently distributed) like in "hello kitty" case, then RNN could help by identifying this connection:

ok, attacker1.attack1, ok, ok, attacker1.attack2, ok, attacker2.attack

RNN could identify connection (correlation) between attacker1.attack1 and attacker1.attack2 (for instance credit card number brute-force) faster than regular network as those attacks, regardless that they could come from different IPs - would still be close in time and have something in common (let's say some digits could repeat or have some underlying generating function). Without RNN, feedforward network would have to use parameters from only one point

TLDR; RNN (especially Long-Term-Short-Term) allow you to specify a prior belief about relationships between events in "time", so you could benefit from it if those events are actually coupled (grouped) in some way even if they're not strictly sequential. Basically RNN provides a powerful way to apply linear/non-linear cross-"correlation" (much like CNN, but more general).

Best Answer

Related Solutions

Solved – Mining patterns in continuous sequence

Solved – RNNs for Non-sequential Data

Related Question