Solved – How to select features for clustering and, once the model is generated, interpret the results

clusteringmachine learning

I am using open data to do a clustering on all manufacturers within a certain region. Attributes are like number of locations, number of employees, annual revenue, number of directors, whether they are locally owned or not, whether they are female owned or not, whether they are doing business outside manufacturing or not, within manufacturing category what kinds of sub-category they are involved, whether they are part of any kind of association or chamber of commerce, whether they are registered locally, regionally, provincial or national.
And I am also using the postal code of these manufacturers to merge with census data.
More than 200 attributes are generated and I try to use k-means to generate a clustering model.
I transform all necessary attributes to numeric and use "remove correlated attribute" function to choose those attributes to build the model.
There are 5 clusters, now the problem is how to interpret these manufacturers within each cluster. I try to use classification to find variable importance to make a sense of these clusters. However it turns out there is no obvious explanation.

I wonder do anyone do same project like this or I am in the wrong direction?

Best Answer

The first thing I would do is break out features that could be useful for classification vs features that are purely descriptive. The goal is to find a select set of features which will put the manufacturers on a comparable basis. This can be done judgmentally or model-based -- if a target variable(s) can be defined. By asking which features are "controllable" or manipulable by a manufacturer vs features that are "structural" or fixed, features can be partitioned for this purpose. For the purposes of classification, structural factors are best. For instance, the brief list of features that you mention could be used for classification with two exceptions: first and to @anony-mousse 's point, postal code would not be a good classifier and is probably not even a good descriptor as it is so granular. The same goes for manufacturing category, unless it is like an SIC code which can be aggregated up to broad groupings such as Division A (Agriculture) or B (Mining), etc., and, within Division D (Manufacturing), the Major Groups.

Be careful about converting everything to numerics. This is always quite dangerous and a bad idea since, e.g., postal codes in the US are either 5 or 9 digit numbers that should always be treated as categorical values. If the feature is truly categorical and doesn't have too many levels, consider converting it to a set of 0,1 dummy variables. Postal code is an example of a feature with way too many levels, unless you are comfortable with some of the workarounds for massively categorical information that are out there. Consider rolling it up to the state level or even higher to the zip code region level.

Integrating census data at the postal code level is a good idea insofar as it contributes information about the market conditions where the manufacturer is doing business. This would facilitate grouping locations by the similarity of their markets. This is not such a good idea if the manufacturers' customers or client base is largely outside of the specific location where the products are being manufactured, as is the case with online sales or widely distributed brick and mortar stores. In addition and assuming the analysis is in the USA, the Census Bureau, the Bureau of Labor Statistics, the Bureau of Economic Analysis, Statistical Abstracts, as well as many other federal agencies have reports that detail many factors about markets and industries that could be mined to further enhance the data. Many, if not most, of these can be accessed online.

Once the raw data issues are handled, there is the question of how to classify these manufacturers. Again to @anony-mousse's point, with such a mixture of information and scale types, k-means is going to give crap -- forget about it. Broadly speaking the approach can be described as peer grouping where the options are supervised vs unsupervised methods.

This is where methodologists can disagree -- mostly as a function of their training. My preference is to use unsupervised approaches rooted in scale invariant, finite mixture models such as latent class models. For additional background on these models, see Statistical Innovations Latent Gold software website for many papers and suggestions about building them -- you don't have to buy their software for access. There is a free R module out there -- polCA -- but it is for the limiting case where the features are all categorical only. I'm not aware of freeware that does true mixture models. Perhaps others on this thread can suggest something.

Of course there are always decision trees, SVMs and a boatload of other machine learning algorithms that are supervised learning methods assuming a target is defined. Kohonen, self-organizing maps is one approach for unsupervised learning.

Just make sure that, whatever algorithm is used, it can handle a mixture of continuous and discrete information.

Once the manufacturers are partitioned, use the classifying as well as the descriptive information to "interpret" the clusters. Since there won't be a "ground truth" against which to validate the solution, you can check it for internal statistical consistency, stability and validation by running it multiple times with varying inputs to stress test the results, using k-fold cross-validation to compare how much entities move around.

And so on...

Related Question