Solved – Clustering objects based on event timestamps

clusteringunsupervised learning

I have data for $n \approx 500$ objects, and for each object I have between ~50 and ~200 observations. Each observation consists primarily of a timestamp when an event happened (and includes some minor data about the event, but I don't think that's hugely important). I'm interested in clustering the $n$ objects into groups based on the patterns of when the events happened.  

A simple parallel example: You have 500 entrances to parking lots surrounding a stadium. At each entrance you record the time whenever a car enters the parking lot (and some minor auxiliary data like how the car weighs). There are some common patterns they all share, particularly that very few cars arrive after the event starts. There are also different patterns, e.g. one may have lots of cars spread out through time, while another may have only a small number at the last minute. How can you cluster the data such that entrances with similar entrance patterns are grouped together?

Here's a visual example. x-axis is time. y-axis is arbitrary sorting of the objects, such that each object is a row. I drew some colors indicating possible clusterings, but my selections were fairly arbitrary.

visual example

Best Answer

This question about time series clustering is similar. Essentially, your question boils down to determining distances or (dis)similarities between series of timestamps. (Once you have distances, you can use any clustering algorithm - I personally like DBSCAN.)

A couple of possibilities come to mind, depending on whether the number of events should have a higher impact than the timing or vice versa. Should two series of timestamps be "similar" if they both have 20 events, but at wildly different times... or if one has 20 events and the other 5, but these 5 events occur at exactly the same time as 5 out of the 20 in the first series?

You could bucketize your timestamps into smaller time intervals, e.g., two-minute buckets for your car example, then get integer time series by counting how many time stamps fall into each bucket, then calculate correlations over time, or Hellinger distances. Depending on how you answer the "number vs. timing" question above, you may or may not first want to normalize each time series by the total number of events. Or, if you want to include additional time-dependent information like car weight in your example, you could add the weights in each time bucket, instead of counting the number of events.

Related Question