Solved – Outlier detection with data (which has categorical and numeric variables) with R

k-meansoutliersr

Scenario

I have a project about fraud detection where i need to find outliers by kmeans.

  1. I have a dataset about bank credits length of 1000.

  2. There are 21 columns (14 categorical, 7 numeric columns).

Issue

I want to find outliers by clustering data and I need to put all outliers inside the same cluster. How can I achieve this with R.

my tries

I have tried by "lofactor", but categorical columns caused me an error.

I deleted categorical columns, then it worked.

results

But I shouldn't delete categorical columns since they are also important for determining outliers.

So how can I achieve to find outlier pattern in R?

Best Answer

Let's look at a standard definition for outliers in fraud detection first (paraphrased from Han et al. Data Mining, 2012):

A customer generates transactions, which follow roughly a Gaussian distribution, consider e.g. buying a bigger lunch one day, a smaller the other and so on. An outlier is a data object, which deviates significantly from the rest of the objects, as if following a different distribution.

I.e. when plotting a numeric variable, those points that deviate from your Gaussian distribution are your outliers (could use e.g. a Q-Q plot, standard scores, or other methods).


Proceeding.

Using unsupervised learning, e.g. clustering, forms, indeed as mkemp6 noticed, a valid subgroup of distinctly different data points. However there is no reason not to declare these as a groups of outliers.

You are, however, rather facing 3 problems when working with mixed type variables:

  1. If you intend to use non-hierarchical clustering, which I'd suggest, then you will need to determine how many clusters (denoted by $k$) your data should be grouped into
  2. You will end up with a combination of numeric and non-numeric distributions, for the latter you will need to define what an outlier is
  3. With regards to the definition of outliers, outliers might follow very different distributions, hence clustering them might be tricky

Problem #3 is the text book problem of applying unsupervised learning methods to outlier detection, but this is what you will have to live with.

Depending on the number of expressions per categorical variable, your problem is more or less computationally intensive. Without being an expert and without having thought all of these through, here are some ideas.

Option 1

Semi-fast and easy

This option is a special case of clustering, for which you will not need a formal algorithm. You could find all unique combinations of categorical variables (unique(data[,your_categorical_variables])), which gives you the maximum number of possible unique clusters if you consider. From the observation of the distribution of the numeric variables associated with, you can proceed to identify those data points, which do not fit an underlying (probably Gaussian) distribution. However, considering the size of your data set, there will probably not be much repetition, i.e. per unique combination of categorical variable I suspect there will only be few data points, which would make this approach obsolete.

Option 2

Not so fast, not so easy

This actually uses clustering. You pick a hierarchical k-prototypes algorithm. As you can hardly make a graphical observation you can either use your judgement from Option 1 to "guess" clusters, though for outlier detection this might be unsuitable. Rather, you can use an F-test as your stopping criterion. The F-test tells you essentially, whether or not, based on the sum of squared deviations in your clusters, a division in to $k+1$ is statistically significantly better than a division into $k$ clusters. After your clustering is finished, you go on similar to Option 1 to identify rare categorical variable combination, and look at their numeric distributions to detect your outliers.

Option 3

Semi-fast, easy

You select only your numeric variables. You plot the distributions. By graphical inspection you note down possible cluster centers per variable. You use a standard k-means algorithm from the package cluster. You pass the anticipated cluster centers as expected starting points to the clustering algorithm. You use the output index list on your complete data set (incl. categorical data) and determine the rare combinations of categorical variables per cluster. I would particularly investigate very small clusters in your case. You will have to make some assumptions when declaring any cluster as a cluster of outliers.

Notes on k-prototypes

The k-prototypes algorithm defines cluster centers as mixtures of numeric and categorical data points. It follows the standard procedure for hierarchical clustering:

  1. Pick $k$ amount of random cluster centers from your data
  2. Compute distance for one point to each cluster center. The distance function typically contains a Euclidean part for numeric and a 0-1 matching for categorical variables (read linked paper). More elaborate distance functions for the categorical variables might be needed.
  3. Adjust cluster centers according to the mean and mode of all points currently assigned to them
  4. Repeat 2-3 for every data point
  5. Iterate 2-4 until cluster centers do not change anymore

Depending on the level of sophistication of your result, you may also want to look into genetic algorithms to not fall into local optima here; I believe this is a case, for which this might make particular sense.

Notes on the F-Test

This is taken from Cluster Analysis, Everitt et al., 2011. To compare if a clustering into $k+1$ clusters is better than a clustering into $k$ clusters, you calculate the F-statistic in the following manner:

$$ F_(g1,g2)=\frac{\frac{(S_{g1}^2-S_{g2}^2)}{S_{g2}^2}}{\frac{n-g_1}{n-g_2}*(\frac{g_2}{g_1})^{2/p}-1} $$

where

$g1$ = $k$ number of clusters

$g2$ = $k+1$ number of clusters

$n$ = number of data objects

$p$ = number of variables

$S^1$ = the sum of the sum of squared deviations from the cluster centers in your division into $k$ clusters

$S^2$ = the sum of the sum of squared deviations from the cluster centers in your division into $k+1$ clusters

Your division of the $n$ objects into $g2$ clusters is significantly better, if the F-statistics exceeds the critical value of an F-distribution with $p(g2-g1)$ and $p(n-g2)$ degrees of freedom.