Scenario
I have a project about fraud detection where i need to find outliers by kmeans.
-
I have a dataset about bank credits length of 1000.
-
There are 21 columns (14 categorical, 7 numeric columns).
Issue
I want to find outliers by clustering data and I need to put all outliers inside the same cluster. How can I achieve this with R.
my tries
I have tried by "lofactor"
, but categorical columns caused me an error.
I deleted categorical columns, then it worked.
results
But I shouldn't delete categorical columns since they are also important for determining outliers.
So how can I achieve to find outlier pattern in R?
Best Answer
Let's look at a standard definition for outliers in fraud detection first (paraphrased from Han et al. Data Mining, 2012):
I.e. when plotting a numeric variable, those points that deviate from your Gaussian distribution are your outliers (could use e.g. a Q-Q plot, standard scores, or other methods).
Proceeding.
Using unsupervised learning, e.g. clustering, forms, indeed as mkemp6 noticed, a valid subgroup of distinctly different data points. However there is no reason not to declare these as a groups of outliers.
You are, however, rather facing 3 problems when working with mixed type variables:
Problem #3 is the text book problem of applying unsupervised learning methods to outlier detection, but this is what you will have to live with.
Depending on the number of expressions per categorical variable, your problem is more or less computationally intensive. Without being an expert and without having thought all of these through, here are some ideas.
Option 1
Semi-fast and easy
This option is a special case of clustering, for which you will not need a formal algorithm. You could find all unique combinations of categorical variables (
unique(data[,your_categorical_variables])
), which gives you the maximum number of possible unique clusters if you consider. From the observation of the distribution of the numeric variables associated with, you can proceed to identify those data points, which do not fit an underlying (probably Gaussian) distribution. However, considering the size of your data set, there will probably not be much repetition, i.e. per unique combination of categorical variable I suspect there will only be few data points, which would make this approach obsolete.Option 2
Not so fast, not so easy
This actually uses clustering. You pick a hierarchical k-prototypes algorithm. As you can hardly make a graphical observation you can either use your judgement from Option 1 to "guess" clusters, though for outlier detection this might be unsuitable. Rather, you can use an F-test as your stopping criterion. The F-test tells you essentially, whether or not, based on the sum of squared deviations in your clusters, a division in to $k+1$ is statistically significantly better than a division into $k$ clusters. After your clustering is finished, you go on similar to Option 1 to identify rare categorical variable combination, and look at their numeric distributions to detect your outliers.
Option 3
Semi-fast, easy
You select only your numeric variables. You plot the distributions. By graphical inspection you note down possible cluster centers per variable. You use a standard k-means algorithm from the package
cluster
. You pass the anticipated cluster centers as expected starting points to the clustering algorithm. You use the output index list on your complete data set (incl. categorical data) and determine the rare combinations of categorical variables per cluster. I would particularly investigate very small clusters in your case. You will have to make some assumptions when declaring any cluster as a cluster of outliers.Notes on k-prototypes
The k-prototypes algorithm defines cluster centers as mixtures of numeric and categorical data points. It follows the standard procedure for hierarchical clustering:
Depending on the level of sophistication of your result, you may also want to look into genetic algorithms to not fall into local optima here; I believe this is a case, for which this might make particular sense.
Notes on the F-Test
This is taken from Cluster Analysis, Everitt et al., 2011. To compare if a clustering into $k+1$ clusters is better than a clustering into $k$ clusters, you calculate the F-statistic in the following manner:
$$ F_(g1,g2)=\frac{\frac{(S_{g1}^2-S_{g2}^2)}{S_{g2}^2}}{\frac{n-g_1}{n-g_2}*(\frac{g_2}{g_1})^{2/p}-1} $$
where
$g1$ = $k$ number of clusters
$g2$ = $k+1$ number of clusters
$n$ = number of data objects
$p$ = number of variables
$S^1$ = the sum of the sum of squared deviations from the cluster centers in your division into $k$ clusters
$S^2$ = the sum of the sum of squared deviations from the cluster centers in your division into $k+1$ clusters
Your division of the $n$ objects into $g2$ clusters is significantly better, if the F-statistics exceeds the critical value of an F-distribution with $p(g2-g1)$ and $p(n-g2)$ degrees of freedom.