Solved – How to find the number of clusters in 1d data and the mean of each

algorithmsclusteringdistributions

We have a list of prices and need to find both the number of clusters (or intervals) and the mean price of each cluster (or interval). The only constraint is that we want cluster means to be at least X distance from each another.

K-means doesn't seem to work because it requires specifying the number of clusters as input.

The reason for finding these is that prices become a "significant" cluster with more data points serve as support and resistance levels for trading. Currently this process is done by simple human observation of clusters of prices on a chart. But the purpose here is to quantify this in an algorithm to make it more objective and measurable.

Best Answer

Don't run clustering (such as k-means) on 1-dimensional data.

Why: 1-dimensional data can be sorted. Algorithms that exploit sorting are much more efficient than algorithms that do not exploit this.

Look at classic statistics

And forget about buzzwords such as "data mining" and "clustering"!

For your task, I recommend you use kernel density estimation. This is a well-proven technique from statistics, and very flexible. To cluster your data, look for maxima and minima in the density estimation to split your data. It's fast, and has a much stronger theoretical background than cluster analysis.

When to use cluster analysis

Essentially, use cluster analysis, when your data is so large and complex you cannot use classic statistical modeling anymore. When you have too many variables and too complex processes to model them. When density estimation no longer works. When you can no longer visualize the data.

Even in 2d data, don't do cluster analysis. Visualize your data, and manually mark your clusters. Methods such as k-means will produce a k-cluster result no matter what; even when there are no clusters in your data set! Because they blindly optimize some mathematical equation, without reality-checking it. If you manually cluster your data, your results will be much more meaningful.

Related Question