Solved – Sequential pattern mining on single sequence

machine learningsequence analysisunsupervised learning

Can someone give me a hint about a good approach to find a frequent patterns in a single sequence.

For example there is the single sequence

3 6 1 2 7 3 8 9 7 2 2 0 2 7 2 8 4 8 9 7 2 4 1 0 3 2 7 2 0 3 8 9 7 2 0

I am looking for a method that can detect frequent patterns in this ordered sequence:

3 6 1 [2 7] 3 [8 9 7 2] 2 0 [2 7] 2 8 4 [8 9 7 2] 4 1 0 3 [2 7] 2 0 3 [8 9 7 2] 0

Also other information would be interesting like:

  • What is the probability that 7 comes after 2
  • When each number has a timestamp assigned to it, what is the estimated time interval that 7 occurs after 2

The sequential pattern mining methods I found require multiple sequences, but I have one large sequence where I want to detect regularities.

Thanks for any help!

Best Answer

Calculate a histogram of N-grams and threshold at an appropriate level. In Python:

from scipy.stats import itemfreq
s = '36127389722027284897241032720389720'
N = 2 # bi-grams
grams = [s[i:i+N] for i in xrange(len(s)-N)]
print itemfreq(grams)

The N-gram calculation (lines three and four) are from this answer.

The example output is

[['02' '1']
 ['03' '2']
 ['10' '1']
 ['12' '1']
 ['20' '2']
 ['22' '1']
 ['24' '1']
 ['27' '3']
 ['28' '1']
 ['32' '1']
 ['36' '1']
 ['38' '2']
 ['41' '1']
 ['48' '1']
 ['61' '1']
 ['72' '5']
 ['73' '1']
 ['84' '1']
 ['89' '3']
 ['97' '3']]

So 72 is the most frequent two-digit subsequence in your example, occurring a total of five times. You can run the code for all $N$ you are interested about.

Related Question