Solved – Pros and cons of clustering algorithms

clustering

So I am somewhat new to the realm of applied statistics and machine learning and am currently trying to figure out how to approach a problem I'm working on. Let me first describe the data that I have.

I am collecting data from a device during user interaction. There are a discrete number of possible events that can happen on this system, and the data will be labeled with (1) what event occurred and (2) what user was using the device. For each event, I will extract features from data gathered on the device during that event. For instance, let's say that the device is a machine running Linux, the user is Bob, and the event is issuing the 'cd' command via command line. When this even occurs I log features such as CPU usage and system time.

So, I have a number of events, each of which are labeled with a user and event descriptor. Each of these events also has N feature values associated with it. I want to cluster per user and event (meaning that I want to run a clustering algorithm for Bob across all the times that he has used the 'cd' command using the extracted feature values). I need the output of this clustering to be a number of descriptors that represent the clusters in N-dimensional space (I was thinking perhaps a centroid and radius per cluster).

So, hopefully that was a sufficient explanation of what I'm working with. Given that I'm a bit of a n00b with this stuff, I'd really appreciate any advice with the following questions:

  1. What clustering algorithms would be well-suited to this type of data and what are their pros/cons?

  2. What are (generally) the variables in clustering algorithms that can be used to fine-tune your model?

  3. Are there a number of standardized ways to succinctly describe clusters other than centroid + radius?

  4. Are there tools outside of Weka that you all would recommend for running this sort of analysis?

  5. Are there any inherent flaws to my approach?

Lastly, if any of you know of any good resources that could be considered "crash courses" in clustering algorithms, I'd love to hear about them!

Thanks in advance!

Best Answer

Well, Weka does not have very many clustering algorithms. It's a classification and machine learning tool, not so much a general purpose data mining tool (I would not call "clustering" and "data mining" to be part of "machine learning" - they don't learn!). There is for example ELKI which has a lot more clustering and outlier detection methods.

However, most of these algorithms are designed for continuous values. Clustering is a structure discovery approach (usually. You might call k-means a partition optimization approach, it does not really care about structure, but it optimizes the in-partition sum of squares of the partitions)

In your use case, I do not think clustering is what you are looking for. You are interested in aggregation and grouping, but I wouldn't do this on a number vector level.

For your use case, you might be much better off e.g. with decision trees or frequent itemset mining and consider the "branches" and "sets" of these algorithms your "clusters".

But it's hard to give you advice if you do not give a particular aim that you are trying to solve. If you are just trying to do something, then anything will do.