Solved – outliers in text document clustering

outlierstext mining

I am using k-means for text categorization. I have some predefined labels (categories) which I want the unlabeled documents to be clustered to. There are some documents that doesn't fit in any of the labels.
for example, I have a dataset containing sport documents. my task is to cluster them to football, volleyball, tennis, … categories/clusters. and as I said, subject of some of the documents are not about any of these labels. considering them as outliers, I want to remove them from the clusters.
what's the easiest way to detect them?

I've seen some methods so far such as :
– mean plus/minus two standard deviations
Is these method a good choice for "text document" outlier detection?
I don't know how (if it's possible) to use this method for documents represented as feature vectors?

  • proximity based models (similarity of document to the centroid of cluster)
    In this case what's a good threshold to identify a document as an outlier? here claims that 0.4 value threshold had the best performance for text classification using kNN.

Best Answer

First, the usual term for what you are describing is 'classification'. Clustering is a form of unsupervised modelling, where no class labels are known. So your task is to classify the unlabelled documents into one of a set of known classes, or to to an "unknown" class for outliers.

Second, you have many options for "outlier detection"! One is as you suggest: classify the documents and define as an outlier anything that is distant from the nearest class (e.g. using standard deviations). Or if you use a probabilistic classifier, such as naive Bayes, you could then define outliers as documents with a very low maximum likelihood.

The best method, and the choice of thresholds, depends very much on the details of your data, so you will need to try out several approaches and see what works best.

Related Question