I have a dataset of about a million companies containing their names, total employees and annual sales. I want to come up with a function that when given the company returns the 5 most similar companies in terms of their distance in total employees and annual sales.
I thought of doing k-means clustering on the dataset and find clusters. Then return all the companies in that cluster. The problem with this approach is that I don't know the number of clusters I should form beforehand.
Also, on a separate note, if I were to obtain a list of specialties (e.g. marketing, software, etc.) for each of the companies – how can I transform this qualitative value in to a number which can later help me calculate similarity.
Best Answer
1) Regarding
total employees and annual sales
features, as both are quantitative you may just use euclidean distance for them. Only don't forget to z-standardize both variables first, as soon as they are of different measure units. Upon the standardizing you indeed may try K-means clustering. This method of cluster analysis is implicitly based on euclidean distances between the objects (the companies); you don't have to compute the pairwise distances, especially as you have million of companies!2) You think you can't decide on the number of clusters beforehand, but you could always do it afterwards. Do properly K-means a number of times, specifying different number of clusters each time, say, from 20 to 2, and save the results (cluster membership variable) each time. Then compare the quality of these 19 solutions by some internal clustering criteria (I'd recommend Calinski–Harabasz or Davies–Bouldin, both based on ANOVA ideology). I'm not R user and cannot recommend a tested package, but use something like NBClust. There are also other ways to determine the "right" number of clusters with K-means, for example "cross-validation" by subsamples. Carefully read something on this topic before you apply.
3) Regarding nominal variables such as
marketing, software
.