MATLAB: KMEANS delivers different results on the same data set

kmeans

I'm performing a cluster analysis on financial time series. The distance measure is correlation.

IDX = kmeans(data',2,'distance','correlation')

The formula above delivers different results on the same set of time series. I’m wondering how this is possible.

Thanks for your help!

Best Answer

Christian, the kmeans functions uses a randomly-chosen starting configuration:

>> help kmeans
 kmeans K-means clustering.
[snip]
    'Start' - Method used to choose initial cluster centroid positions,
       sometimes known as "seeds".  Choices are:
           'sample'  - Select K observations from X at random (the default)
           'uniform' - Select K points uniformly at random from the range
                       of X.  Not valid for Hamming distance.
           'cluster' - Perform preliminary clustering phase on random 10%
                       subsample of X.  This preliminary phase is itself
                       initialized using 'sample'.
            matrix   - A K-by-P matrix of starting locations.  In this case,
                       you can pass in [] for K, and kmeans infers K from
                       the first dimension of the matrix.  You can also
                       supply a 3D array, implying a value for 'Replicates'
                       from the array's third dimension.

Like many optimizations, the K-Means algorithm can end up with different solutions for different starting points. You can take advantage of the randomness built into the kmeans function by running several replicates from different starting points:

    'Replicates' - Number of times to repeat the clustering, each with a
       new set of initial centroids.  A positive integer, default is 1.

Hope this helps

Best Answer

Related Solutions

MATLAB: Kmeans function give us give different answer

MATLAB: Determining the number of ‘replicates’ in ‘start’ parameter in ‘kmeans’ function

Related Question