Solved – Clustering objects based on event timestamps

clusteringunsupervised learning

I have data for $n \approx 500$ objects, and for each object I have between ~50 and ~200 observations. Each observation consists primarily of a timestamp when an event happened (and includes some minor data about the event, but I don't think that's hugely important). I'm interested in clustering the $n$ objects into groups based on the patterns of when the events happened.

A simple parallel example: You have 500 entrances to parking lots surrounding a stadium. At each entrance you record the time whenever a car enters the parking lot (and some minor auxiliary data like how the car weighs). There are some common patterns they all share, particularly that very few cars arrive after the event starts. There are also different patterns, e.g. one may have lots of cars spread out through time, while another may have only a small number at the last minute. How can you cluster the data such that entrances with similar entrance patterns are grouped together?

Here's a visual example. x-axis is time. y-axis is arbitrary sorting of the objects, such that each object is a row. I drew some colors indicating possible clusterings, but my selections were fairly arbitrary.

visual example

Best Answer

This question about time series clustering is similar. Essentially, your question boils down to determining distances or (dis)similarities between series of timestamps. (Once you have distances, you can use any clustering algorithm - I personally like DBSCAN.)

A couple of possibilities come to mind, depending on whether the number of events should have a higher impact than the timing or vice versa. Should two series of timestamps be "similar" if they both have 20 events, but at wildly different times... or if one has 20 events and the other 5, but these 5 events occur at exactly the same time as 5 out of the 20 in the first series?

You could bucketize your timestamps into smaller time intervals, e.g., two-minute buckets for your car example, then get integer time series by counting how many time stamps fall into each bucket, then calculate correlations over time, or Hellinger distances. Depending on how you answer the "number vs. timing" question above, you may or may not first want to normalize each time series by the total number of events. Or, if you want to include additional time-dependent information like car weight in your example, you could add the weights in each time bucket, instead of counting the number of events.

Related Solutions

Solved – Similarity / clustering methods for temporal event data

Step 1: Filter

For each event $X$, define a filter $F_X$ on data points which filters out the elements on $X$ and sorts the timestamps. The output of this filter is now a vector of sorted non-negative reals.

Example: on the data point

[(A,0), (B,1), (C,1), (A,2.2), (B,2.2), (A,2.5), (C,2.7), (A,3.3)]

the filter $F_A$ would yield

[0, 2.2, 2.5, 3.3]

and the filter $F_B$ would yield

[1, 2.2]

Step 3: Window

Next, select a non-negative real "window size" and partition the time-axis into a sequence of right half-open intervals of this size. For example if your size were 1.0 your windows would be the half-open intervals:

[0,1.0), [1.0,2.0), [2.0,3.0), ...

Now from the output of $F_X$ you will group the elements which occurred in the same window. So if your window size is 1.0 then the $F_A$ from above

(0, 2.2, 2.5, 3.3)

would be grouped as

([0], [], [2.2,2.5], [3.3])

Step 3: Aggregate

Now perform a COUNT over these groups; continuing the example we obtain

(1, 0, 2, 1)

Let's denote this event signature of the data point with respect to the event $X$ and the chosen window size $w$ by $E_{X,w}$.

Now define the similarity measure between two data points $v_1$ and $v_2$ with respect to the event $X$ as $|E_{X,w}(v_1) - E_{X,w}(v_2)|$. This could be an $L^1$ norm or an $L^2$ norm, see what works for you.

Note that the $L^*$ norms involve a sum over components of the vector, so you should scale by the dimension of the vectors to normalize.

So now for every event $X$ you have a similarity measure $S_X$. To get a global measure you can just add them up:

$S(v_1,v_2) = \sum_X S_X(v_1,v_2)$

(I'm assuming there is no similarities between these events, so a straight sum is appropriate. I'm also assuming that you have a fixed number of events $X$; if not then you may want to scale by the number to normalize).

You need to choose a window size which provides the right degree of separability between closely occurring events. You should take into account your measurement accuracy.

If you want to get fancy you can do different types of windowing. For example, instead of counting the number of events within a time window, you could ask how long it takes to get up to a fixed number of events within a count-window. Play around and see what fits your data.

Finally, now that you have a real-valued similarity measure you can use $K$-means or whatever other methods you already know of.

Best Answer

Related Solutions

Solved – Similarity / clustering methods for temporal event data

Related Question