Solved – K-Means Clustering Not Working As Expcected

clusteringk-meansmachine learningpythonscikit learn

I have a script that I'm testing with in Python3 with Scikit to cluster terms based on either words or character n-grams. Basically, it's fed a list of training data with corresponding labels. For example:

Name            Label
mexican food    1
greek cuisine   1
hotel night     7
...
airfare         7

After I run the program I type in raw input which should be transformed and predicted. However, no matter what I put it, the program makes the same prediction. This occurs even if I put in a term such as 'mexcian' which only appears once in the training data and hence should be trivial to predict. Can anyone spot the issue?

from __future__ import print_function

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn import metrics

from sklearn.cluster import KMeans, MiniBatchKMeans

import logging
from optparse import OptionParser
import sys
from time import time

import numpy as np


# Display progress logs on stdout
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')

# parse commandline arguments
op = OptionParser()
op.add_option("--lsa",
              dest="n_components", type="int",
              help="Preprocess documents with latent semantic analysis.")
op.add_option("--no-minibatch",
              action="store_false", dest="minibatch", default=True,
              help="Use ordinary k-means algorithm (in batch mode).")
op.add_option("--no-idf",
              action="store_false", dest="use_idf", default=True,
              help="Disable Inverse Document Frequency feature weighting.")
op.add_option("--analyzer",
              type='str', default='word',
              help="Which analyzer to use. Valid options are 'word' and 'char'")
op.add_option("--use-hashing",
              action="store_true", default=False,
              help="Use a hashing feature vectorizer")
op.add_option("--n-features", type=int, default=10000,
              help="Maximum number of features (dimensions)"
                   " to extract from text.")
op.add_option("--verbose",
              action="store_true", dest="verbose", default=False,
              help="Print progress reports inside k-means algorithm.")

print(__doc__)
op.print_help()

(opts, args) = op.parse_args()
if len(args) > 0:
    op.error("this script takes no arguments.")
    sys.exit(1)

opts.analyzer = opts.analyzer.lower()
assert opts.analyzer in ['word','char']

###############################################################################
# Read in the data
inputfile = '../data/dodcategories.csv'
data = np.loadtxt(inputfile,dtype=[('type','|S16'),('subID',np.int),('ID',np.int)],delimiter='\t',skiprows=0,unpack=True)
X = np.array([str(item,'utf-8').lower() for item in data[0]])
labels = np.array(data[1])
true_k = np.unique(labels).shape[0]


print("Extracting features from the training dataset using a sparse vectorizer")
t0 = time()
if opts.use_hashing:
    if opts.use_idf:
        # Perform an IDF normalization on the output of HashingVectorizer
        hasher = HashingVectorizer(n_features=opts.n_features,
                                   stop_words='english', non_negative=True,
                                   norm=None, ngram_range=(1, 10), binary=False, analyzer=opts.analyzer)
        vectorizer = make_pipeline(hasher, TfidfTransformer())
    else:
        vectorizer = HashingVectorizer(n_features=opts.n_features,
                                       stop_words='english',
                                       non_negative=False, norm='l2',
                                       binary=False, ngram_range=(1, 10), analyzer=opts.analyzer)
else:
    vectorizer = TfidfVectorizer(max_df=0.5, max_features=opts.n_features,
                                 min_df=2, stop_words='english',
                                 use_idf=opts.use_idf, ngram_range=(1, 10),analyzer=opts.analyzer)
X = vectorizer.fit_transform(X)

print('------------------------------------------------')
print("done in %fs" % (time() - t0))
print("n_samples: %d, n_features: %d" % X.shape)
print()

if opts.n_components:
    print("Performing dimensionality reduction using LSA")
    t0 = time()
    # Vectorizer results are normalized, which makes KMeans behave as
    # spherical k-means for better results. Since LSA/SVD results are
    # not normalized, we have to redo the normalization.
    svd = TruncatedSVD(opts.n_components)
    lsa = make_pipeline(svd, Normalizer(copy=False))

    X = lsa.fit_transform(X)

    print("done in %fs" % (time() - t0))

    explained_variance = svd.explained_variance_ratio_.sum()
    print("Explained variance of the SVD step: {}%".format(
        int(explained_variance * 100)))

    print()


###############################################################################
# Do the actual clustering

if opts.minibatch:
    km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
                         init_size=1000, batch_size=1000, verbose=opts.verbose)
else:
    km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1,
                verbose=opts.verbose)

print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X,labels)
print("done in %0.3fs" % (time() - t0))
print()

print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(labels, km.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels, sample_size=1000))

print()

if not (opts.n_components or opts.use_hashing):
    #print("Top terms per cluster:")
    order_centroids = km.cluster_centers_.argsort()[:, ::-1]
    terms = vectorizer.get_feature_names()
    #for i in range(true_k):
        #print("Cluster %d:" % i, end='')
        #for ind in order_centroids[i, :10]:
            #print(' %s' % terms[ind], end='')
        #print()

test = 'test';   
while test.lower() not in ['exit','',None]:
    test = input("Enter a category (Type exit to quit): ")
    X_test = [test.lower()]
    print("Test: {}".format(X_test))
    X_test = vectorizer.transform(X_test)
    print("Test: {}".format(X_test))
    result = km.predict(X_test)
    print("Result: {}".format(result))

Best Answer

k-means:

  • does not work well for high-dimensional data
  • is sensitive to noise (and text is very noisy)
  • not a classification algorithm

I'm not surprised it doesn't work, because you are looking for a classifier, not a clustering algorithm.

Try looking at the frequency of your clusters. I wouldn't be surprised if almost everything ends up in the same megacluster, and the other "clusters" are wasted on some outliers. That would be the typical (useless) result of k-means on such data sets. Sorry, k-means is not a magic bullet.

Related Question