Solved – Calculating similarity and clustering question

clusteringdistance-functionsrsimilarities

I have a dataset of about a million companies containing their names, total employees and annual sales. I want to come up with a function that when given the company returns the 5 most similar companies in terms of their distance in total employees and annual sales.

I thought of doing k-means clustering on the dataset and find clusters. Then return all the companies in that cluster. The problem with this approach is that I don't know the number of clusters I should form beforehand.

Also, on a separate note, if I were to obtain a list of specialties (e.g. marketing, software, etc.) for each of the companies – how can I transform this qualitative value in to a number which can later help me calculate similarity.

Best Answer

1) Regarding total employees and annual sales features, as both are quantitative you may just use euclidean distance for them. Only don't forget to z-standardize both variables first, as soon as they are of different measure units. Upon the standardizing you indeed may try K-means clustering. This method of cluster analysis is implicitly based on euclidean distances between the objects (the companies); you don't have to compute the pairwise distances, especially as you have million of companies!

2) You think you can't decide on the number of clusters beforehand, but you could always do it afterwards. Do properly K-means a number of times, specifying different number of clusters each time, say, from 20 to 2, and save the results (cluster membership variable) each time. Then compare the quality of these 19 solutions by some internal clustering criteria (I'd recommend Calinski–Harabasz or Davies–Bouldin, both based on ANOVA ideology). I'm not R user and cannot recommend a tested package, but use something like NBClust. There are also other ways to determine the "right" number of clusters with K-means, for example "cross-validation" by subsamples. Carefully read something on this topic before you apply.

3) Regarding nominal variables such as marketing, software.

One way is to recode the variables into series of dummy (1 vs 0) variables and perform still K-means clustering on those as if it were quantitative variables. This approach is not valid geometrically and logically, but heuristically it can be used, and indeed is being used by many. The proper way with dummies would be to compute Dice similarity measure (other similarity measures for binary features are permissible, too) and do clustering by some appropriate method (not K-means); however the problem in your case is that you have too many objects. It is impossible to create at once such a huge similarity matrix.
In SPSS, there is two-step clustering procedure which can cluster huge number of objects and also allows nominal variables as well as quantitative. I believe that method to be a good choice for you. The method is just slightly modified BIRCH clustering method. I don't know if "two-step clustering" is implemented in R, but BIRCH should be implemented, I believe. I don't know if BIRCH can take nominal variables.

Related Solutions

Solved – Comparing hierarchical clustering dendrograms obtained by different distances & methods

To compare the similarity of two hierarchical (tree-like) structures, measures based on cophenetic correlation idea are used. But is it correct to perform comparison of dendrograms in order to select the "right" method or distance measure in hierarchical clustering?

There are some points - hidden snags - regarding hierarchical cluster analysis that I would hold quite important:

Never compare (in order to select the method giving stronger partition) dendrograms obtained by different agglomeration methods visually. It won't tell which method is "better" at that. Each method has its own "prototypical" tree look: the trees will differ consistently even when the data have no cluster structure or have random cluster structure. (And I don' think there exist a standardization or measure that would take off these intrinsic differences.). You may, however, compare dendrogram looks of results produced by the same method but different data. Maxim: direct, appearance comparing of dendrograms after different methods is unacceptable.
Do not decide on the number of clusters (i.e. where to cut the tree) by looking at the dendrogram of Ward method. In Ward, the tree shows the growth of the summative, not the averaged, coefficient of colligation; and the consequence is that since later clusters are bigger by the number of points, the later clusters look misleadingly "better" on the tree. To standardize Ward's dendrogramm appropriately, divide the coefficient growth at each step by the overall number of points in the two clusters being combined (such standardized Ward dendrogram, though, may be hard to implement graphically).$^1$ Maxim: choosing a cut level by contemplating a dendrogram appearance, while possible, is not the best method to select the partition, and for some methods may be misleading. It is recommended to rely on some formal internal clustering criterion instead (see also here).
Albeit no-one can forbid you "experimenting" with distance measures or agglomerative methods, it is better to select the distance and the method consciously, not blind trying. The distance should reflect the aspects of difference you are interested in, and the method - one must be aware - implies a specific archetype of a cluster (e.g. the metaphor of a Ward cluster is, I would say, type; cluster after complete linkage would be circle [by hobby or plot]; cluster after single linkage would be spectrum [chain]; cluster after centroid method would be proximity of platforms [politics]; an average linkage cluster is conceptually most undifferentiated and would be generally united class).
Some methods call for right distance measures and/or right type of data. Ward and centroid, for example, logically require (squared) euclidean distance - because these methods engage in computation of centroids in euclidean space. And computation of geometric centroids is incongruous with, for example, binary data; the data should be scale/continuous. Maxim: data/distance/method assumptions and correspondence is very important and not so easy question.
Preprocessing (such as centering, scaling and other forms of transformation of variables/features) prior computation of a distance matrix and doing the clustering is extremely important question, too. It can dramatically influence the results. Think over what preprocessing may help you and will make sense from the interpretation point of view. Also, never be shy to carefully inspect you data graphically before attempting to do cluster analysis.
Not all methods of agglomerative clustering can be equally seen as giving you hierarchical classification... on philosophical grounds. For example, centroid method do gives hierarchy in a sense, because cluster centre is an emergent and defining feature of a cluster as a whole, and merging clusters is driven by that feature. Complete linkage, on the other hand, "dismisses" both subclusters when it merges them - by virtue of distancing among individual objects of the two. Thus, complete linkage dendrogram is merely a history of collection and not a parent-child sort of taxonomy. Maxim: hierarchical agglomerative cluster analysis, generally, expects that you make a partition based on its result, rather than see the result as hierarchical taxonomy.
Hierarchical clustering is typical greedy algorithm that makes the best choice among alternatives appearing on each step in the hope to get close to optimal solution in the end. However, the "best" choice appearing on a high level step is likely to be poorer than global optimum theoretically possible on that step. The greater is the step, the greater is the suboptimality, as a rule. Given that we usually want few clusters last steps are important; and, as just said, they are expected to be relatively poor if the steps' number is high (say, thousandth step). That's why hierarchical clustering is generally not recommended for large samples of objects (numbering thousands of objects) even if the program could handle such a big distance matrix.

If after the above precautions you continue to think that you want a measure of similarity between hierarchical classifications you might google on 'comparing of dendrograms' and 'comparing hierarchical classifications'. One most suggesting itself idea may be based on the cophenetic correlation: having two dendrograms for the same dataset of n objects, let $X_{ij}$ be coefficient of colligation (or maybe its rank, the step number) between every pair of objects ij in one dendrogram, and $Y_{ij}$ likewise be the same in the other dendrogram. Compute correlation or cosine.

$^1$ Later update on the problem of dendrogram of Wards's method. Different clustering programs may output differently transformed aglomeration coefficients for Ward's method. Hence their dendrograms will look somewhat differently despite that the clustering history and results are the same. For example, SPSS doesn't take the root from the ultrametric coefficients, and it cumulates them in the output. Another tradition (found in some R packages, for example) is to take the root (so called "Ward-2" implementations) and not to cumulate. To repeat again, such differences affect only the general shape/looks of the dendrogram, not the clustering results. But the looks of the dendrogram might influence your decision about the number of clusters. The moral is that it would be safe not to rely on dendrogram in Ward's method at all, unless you know exactly what are these coefficients out of your program and how to interpret them correctly.

Solved – Word clustering using different algorithms

Clustering algorithms like KMeans take a matrix where the rows are the observation and the columns are features for that observation. If you have a set of words (w1,w2,...,wn) then you could set up your similarity matrix where the rows represent a specific word and each column is the similarity to each other word. For Example:

      w1 |  w2  ... wn
--------------------
w1 | 1.0 | .12  |  .03
--------------------
w2 | .12 | 1.0  |  .32
--------------------
...|     |      |
--------------------
w3 | .02 | .32  |  1.0

Where any row,column value would represent the the similarity between the word along the row with the word along the column.

This would be your X matrix which you could feed to a clustering algorithm like KMeans (scikit-learn page)

Best Answer

Related Solutions

Solved – Comparing hierarchical clustering dendrograms obtained by different distances & methods

Solved – Word clustering using different algorithms

Related Question