Solved – Pattern identification in time series data

pattern recognitionsequence analysistime series

Imagine a set of users who always perform the same set of tasks and in the same order. Say – user A – enters the store, reads the newspaper, looks at the TV, orders coffee, eats a donut and leaves. User B, enters the store, orders coffee, reads a magazine, leaves and eats a donut.

However, what we see are only user actions/activity and user identity isn't captured. So we see events like – someone entered the store, read the paper, ordered coffee, ordered coffee, reads a magazine, eats a donut, leaves, leaves, eats a donut.

Given the activity log, is it possible to re-thread actions to identify recurring patterns that would give us users? Obviously, we see the same "user" over and over, billions of times in the data set so the pattern repeats but is interspersed with activity of other users.

What I want to predict is – given an event at time $t_i$, does the event fit a known user's pattern/order of events or is this part of a new pattern we have not seen before.

I bit more formally, if our inputs are 1……n, the set of possible
events generated A….Z (an actual sequence of events generated by an
input would be a permutation with repetition), and X~i~ denotes an
event where X is an event for the ith input, then the sequence of
events generated by say,

input 1 is:

A1, B1, E1, X1, Y 1

input 2:
C2, D2, A2, X2

input 3:


Z3, D3, N3, E3, Y 3

If all inputs – 1, 2 and 3 are triggered simultaneously or in close
succession, the events could occur in this order:


A1, B1, C2, D2, E1, Z3, D3, N3, X1, Y1, A2, X2, E 3, Y3

But what we observe as event log:

A, B, C, D, E, Z, D, N, X, Y, A, X, E, Y

The idea is to take the observed events from the log, and identify what
input or sequence of events does a given event belong to.

While I am not a mathematician or statistician, I have tried to research
or apply the following algorithms but failed to find a match.

Fourier transform : I tried to transform the events into some sort
of signal that can then be decomposed into individual components/inputs.
However, while each input does repeat but interval between successive
repetitions isn’t fixed. Also, the time delta between successive events
in the same sequence isn’t fixed either – we just know the order is
fixed.

Motif discovery : My understanding is that for motif discovery –
each sequence of events appears in order and together in a dataset so
events from different sequences are not interspersed. Also, looks like
sequence of events are the same size which is not the same with my use
case.

Hidden Markov Model : First, the observations have to be equally
spaced along time. That isn’t the case with my dataset but I can flatten
the time stamps and assume all events are at equal and repeated time
intervals. However, the same hidden process does not generate our
observations and events have dependencies – that is – for say, input 1,
B1 will only happen after A1.

Frequent Pattern Mining : Intuitively, I understand these are more
for identifying events that occur/appear together, at a given time.
Also, order isn’t important for these patterns because, for example, if
milk and eggs are always bought together then it doesn’t matter which
one is bought first, the other will follow.

Few others like Association rules or Random forests – however these seem
to rely on events leading up to a result of some sort – which again
doesn’t match my use case since I do not have any results/marker
generated by a sequence of events.

Best Answer

If you don't have training dataset and don't have a result set that you want to get but rather understand a pattern, clustering algorithms might be your best bet. There are many clustering algorithms with pros and cons of each but you would be able to feed the data in and you'd get a clusters based on distance. Since it seems that you only have categorical data, you might need to use something like k-modes algorithm which would give you a heuristic solution.

Related Question