Solved – Calculating similarity and clustering question

clusteringdistance-functionsrsimilarities

I have a dataset of about a million companies containing their names, total employees and annual sales. I want to come up with a function that when given the company returns the 5 most similar companies in terms of their distance in total employees and annual sales.

I thought of doing k-means clustering on the dataset and find clusters. Then return all the companies in that cluster. The problem with this approach is that I don't know the number of clusters I should form beforehand.

Also, on a separate note, if I were to obtain a list of specialties (e.g. marketing, software, etc.) for each of the companies – how can I transform this qualitative value in to a number which can later help me calculate similarity.

Best Answer

1) Regarding total employees and annual sales features, as both are quantitative you may just use euclidean distance for them. Only don't forget to z-standardize both variables first, as soon as they are of different measure units. Upon the standardizing you indeed may try K-means clustering. This method of cluster analysis is implicitly based on euclidean distances between the objects (the companies); you don't have to compute the pairwise distances, especially as you have million of companies!

2) You think you can't decide on the number of clusters beforehand, but you could always do it afterwards. Do properly K-means a number of times, specifying different number of clusters each time, say, from 20 to 2, and save the results (cluster membership variable) each time. Then compare the quality of these 19 solutions by some internal clustering criteria (I'd recommend Calinski–Harabasz or Davies–Bouldin, both based on ANOVA ideology). I'm not R user and cannot recommend a tested package, but use something like NBClust. There are also other ways to determine the "right" number of clusters with K-means, for example "cross-validation" by subsamples. Carefully read something on this topic before you apply.

3) Regarding nominal variables such as marketing, software.

  • One way is to recode the variables into series of dummy (1 vs 0) variables and perform still K-means clustering on those as if it were quantitative variables. This approach is not valid geometrically and logically, but heuristically it can be used, and indeed is being used by many. The proper way with dummies would be to compute Dice similarity measure (other similarity measures for binary features are permissible, too) and do clustering by some appropriate method (not K-means); however the problem in your case is that you have too many objects. It is impossible to create at once such a huge similarity matrix.
  • In SPSS, there is two-step clustering procedure which can cluster huge number of objects and also allows nominal variables as well as quantitative. I believe that method to be a good choice for you. The method is just slightly modified BIRCH clustering method. I don't know if "two-step clustering" is implemented in R, but BIRCH should be implemented, I believe. I don't know if BIRCH can take nominal variables.