Solved – Similarity / clustering methods for temporal event data

bayesianmachine learningneural networkssimilaritiesunsupervised learning

I am looking for methods to calculate a reasonable similarity metric or clustering for my data. I have data in the form of a number of distinct events (lets call them A, B, C, D, …), and the time between the events. A few example data points could look like,

X1 = [(A, 0), (B, 2.2), (C, 3.2), (A, 4.0)]
X2 = [(B, 0), (C, 2.2)]
X3 = [(B, 0), (C, 2.2), (D, 4.2)]
X4 = [(B, 0), (A, 2.2), (E, 1.5)]
X5 = [(B, 0), (F, 2.2), (E, 1.5), (F, 5.0), (G, 2.0), (H, 2.0)]
X6 = [(B, 0), (A, 0.0), (E, 1.5), (F, 5.0), (G, 2.0)]
X7 = [(I, 0), (J, 0.0), (K, 1.5), (L, 5.0), (M, 2.0)]

To explain the format: each entry in the array is in the format event, time since first event. E.g. for X1, at time=0, event A happened. At time=3.2 (after event A), event C happened, and so forth. Also note that multiple events may happen at the same time.
In reality, the sequences are typically much longer (50-100ish). The "vocabulary" of events is 1500ish. The time between two events ranges from 0.0-100.

I have a good few million data points, give or take, so I have been trying to look into a clustering based approach. From the above example, I would expect X1, X2, and X3 to have some similarity (X4 maybe, since the pattern A->B is just reversed). Also, X5 and X6 should be fairly similar. X7 should not have any similarity to any of the patterns. This is a somewhat fuzzy explanation, and that is intended. I am not looking for the one method that clusters perfectly – I am just looking for one that does an OK job taking all variables into consideration

I have been looking into things like Latent Dirichlet Allocation, Self Organising Maps, unsupervised Recurrent Neural Networks, Autoencoders, just counting the similarity between triplets, …

My major problems in finding something reasonable has been,

  1. The method expects an equal number of features
  2. The method sees the sequence as a bag-of-words
  3. The method does not implicitly handle a mixture of categorical and continous data
  4. The method has no notion of time

I put some very broad categories on this question, simply because I am looking for suggestions or pointers to software/papers.

Any help would be much appreciated, and please do comment if something is not clear.

Edit 1: I started looking into Recurrent Neural Networks, but the information is quite dense, and the unsupervised part does not seem to be that well developed. The best paper I could find was Klapper-Rybicka et al. (2001), but they apply information theoretic approaches tailored specifically for continous acoustic data (Binary Information Gain Optimization and Nonparametric Entropy Optimization), and I am not quite sure how to translate these measures to the type of data that I have described above.

If anyone could elaborate on the unsupervised training of recurrent neural networks using data as I have described above I would be very thankful.

Best Answer

Step 1: Filter

For each event $X$, define a filter $F_X$ on data points which filters out the elements on $X$ and sorts the timestamps. The output of this filter is now a vector of sorted non-negative reals.

Example: on the data point

[(A,0), (B,1), (C,1), (A,2.2), (B,2.2), (A,2.5), (C,2.7), (A,3.3)]

the filter $F_A$ would yield

[0, 2.2, 2.5, 3.3]

and the filter $F_B$ would yield

[1, 2.2]

Step 3: Window

Next, select a non-negative real "window size" and partition the time-axis into a sequence of right half-open intervals of this size. For example if your size were 1.0 your windows would be the half-open intervals:

[0,1.0), [1.0,2.0), [2.0,3.0), ...

Now from the output of $F_X$ you will group the elements which occurred in the same window. So if your window size is 1.0 then the $F_A$ from above

(0, 2.2, 2.5, 3.3)

would be grouped as

([0], [], [2.2,2.5], [3.3])

Step 3: Aggregate

Now perform a COUNT over these groups; continuing the example we obtain

(1, 0, 2, 1)

Let's denote this event signature of the data point with respect to the event $X$ and the chosen window size $w$ by $E_{X,w}$.

Now define the similarity measure between two data points $v_1$ and $v_2$ with respect to the event $X$ as $|E_{X,w}(v_1) - E_{X,w}(v_2)|$. This could be an $L^1$ norm or an $L^2$ norm, see what works for you.

Note that the $L^*$ norms involve a sum over components of the vector, so you should scale by the dimension of the vectors to normalize.

So now for every event $X$ you have a similarity measure $S_X$. To get a global measure you can just add them up:

$S(v_1,v_2) = \sum_X S_X(v_1,v_2)$

(I'm assuming there is no similarities between these events, so a straight sum is appropriate. I'm also assuming that you have a fixed number of events $X$; if not then you may want to scale by the number to normalize).

You need to choose a window size which provides the right degree of separability between closely occurring events. You should take into account your measurement accuracy.

If you want to get fancy you can do different types of windowing. For example, instead of counting the number of events within a time window, you could ask how long it takes to get up to a fixed number of events within a count-window. Play around and see what fits your data.

Finally, now that you have a real-valued similarity measure you can use $K$-means or whatever other methods you already know of.

Related Question