MATLAB: How to avoid uncertainty in processing result of MATLAB Statistics Toolbox

uncertainty of processing result

I’m annoyed with the uncertainty of the processing result of my MATLAB program. My codes are as follows.
%—————————–
clear all; close all;
a = [0.3948 0.4644 0.4412 0.6270 0.6270 0.1626];
[idx c] = kmeans(a,2)
rate = c(1)/c(2)
%—————————–
I ran this program several times and found the results were quite interesting. Although the data set to be processed was determinate, the processing results could be different each time. I found there were at least four groups of answers.
%—————————–
idx = 1 1 1 2 2 1 c = 0.3658 0.6270 rate = 0.5833
idx = 1 1 1 1 1 2 c = 0.5109 0.1626 rate = 3.1419
idx = 2 2 2 1 1 2 c = 0.6270 0.3658 rate = 1.7143
idx = 2 2 2 2 2 1 c = 0.1626 0.5109 rate = 0.3183
%—————————–
Can anybody help me on how to avoid this uncertainty? BTW, my MATLAB version is R2008a.
Thank you in advance for any response.
Best regards,
Jean

Best Answer

This is expected behavior because KMEANS by default selects the initial cluster centroid positions at random (albeit from the observations). That is, the value of the 'start' parameter is set to 'sample' as can be seen from the documentation. Another outcome you would also observe if you run your code several times is that KMEANS errors out because an empty cluster is created at the first iteration (i.e., idx is all 1's or all 2's). You could always pass a matrix of initial positions as the value for the 'start' parameter, for example:
[idx c] = kmeans(a,2,'start',[0 0.5]')
This would yield the same result every time but since the partition returned by KMEANS highly depends on the initial centroid positions, you would probably get a sub-optimal partition (unless your provide a "lucky" vector for the 'start' parameter). The typical use of KMEANS entails setting the 'Replicates' parameter to an integer n corresponding to the number of times to repeat the clustering. KMEANS then returns the partition with the lowest sum, over all clusters, of the within-cluster sums of point-to-cluster-centroid distances.
Related Question