Solved – Outlier detection with data (which has categorical and numeric variables) with R

k-meansoutliersr

Scenario

I have a project about fraud detection where i need to find outliers by kmeans.

I have a dataset about bank credits length of 1000.
There are 21 columns (14 categorical, 7 numeric columns).

Issue

I want to find outliers by clustering data and I need to put all outliers inside the same cluster. How can I achieve this with R.

my tries

I have tried by "lofactor", but categorical columns caused me an error.

I deleted categorical columns, then it worked.

results

But I shouldn't delete categorical columns since they are also important for determining outliers.

So how can I achieve to find outlier pattern in R?

Best Answer

Let's look at a standard definition for outliers in fraud detection first (paraphrased from Han et al. Data Mining, 2012):

A customer generates transactions, which follow roughly a Gaussian distribution, consider e.g. buying a bigger lunch one day, a smaller the other and so on. An outlier is a data object, which deviates significantly from the rest of the objects, as if following a different distribution.

I.e. when plotting a numeric variable, those points that deviate from your Gaussian distribution are your outliers (could use e.g. a Q-Q plot, standard scores, or other methods).

Proceeding.

Using unsupervised learning, e.g. clustering, forms, indeed as mkemp6 noticed, a valid subgroup of distinctly different data points. However there is no reason not to declare these as a groups of outliers.

You are, however, rather facing 3 problems when working with mixed type variables:

If you intend to use non-hierarchical clustering, which I'd suggest, then you will need to determine how many clusters (denoted by $k$) your data should be grouped into
You will end up with a combination of numeric and non-numeric distributions, for the latter you will need to define what an outlier is
With regards to the definition of outliers, outliers might follow very different distributions, hence clustering them might be tricky

Problem #3 is the text book problem of applying unsupervised learning methods to outlier detection, but this is what you will have to live with.

Depending on the number of expressions per categorical variable, your problem is more or less computationally intensive. Without being an expert and without having thought all of these through, here are some ideas.

Option 1

Semi-fast and easy

This option is a special case of clustering, for which you will not need a formal algorithm. You could find all unique combinations of categorical variables (unique(data[,your_categorical_variables])), which gives you the maximum number of possible unique clusters if you consider. From the observation of the distribution of the numeric variables associated with, you can proceed to identify those data points, which do not fit an underlying (probably Gaussian) distribution. However, considering the size of your data set, there will probably not be much repetition, i.e. per unique combination of categorical variable I suspect there will only be few data points, which would make this approach obsolete.

Option 2

Not so fast, not so easy

This actually uses clustering. You pick a hierarchical k-prototypes algorithm. As you can hardly make a graphical observation you can either use your judgement from Option 1 to "guess" clusters, though for outlier detection this might be unsuitable. Rather, you can use an F-test as your stopping criterion. The F-test tells you essentially, whether or not, based on the sum of squared deviations in your clusters, a division in to $k+1$ is statistically significantly better than a division into $k$ clusters. After your clustering is finished, you go on similar to Option 1 to identify rare categorical variable combination, and look at their numeric distributions to detect your outliers.

Option 3

Semi-fast, easy

You select only your numeric variables. You plot the distributions. By graphical inspection you note down possible cluster centers per variable. You use a standard k-means algorithm from the package cluster. You pass the anticipated cluster centers as expected starting points to the clustering algorithm. You use the output index list on your complete data set (incl. categorical data) and determine the rare combinations of categorical variables per cluster. I would particularly investigate very small clusters in your case. You will have to make some assumptions when declaring any cluster as a cluster of outliers.

Notes on k-prototypes

The k-prototypes algorithm defines cluster centers as mixtures of numeric and categorical data points. It follows the standard procedure for hierarchical clustering:

Pick $k$ amount of random cluster centers from your data
Compute distance for one point to each cluster center. The distance function typically contains a Euclidean part for numeric and a 0-1 matching for categorical variables (read linked paper). More elaborate distance functions for the categorical variables might be needed.
Adjust cluster centers according to the mean and mode of all points currently assigned to them
Repeat 2-3 for every data point
Iterate 2-4 until cluster centers do not change anymore

Depending on the level of sophistication of your result, you may also want to look into genetic algorithms to not fall into local optima here; I believe this is a case, for which this might make particular sense.

Notes on the F-Test

This is taken from Cluster Analysis, Everitt et al., 2011. To compare if a clustering into $k+1$ clusters is better than a clustering into $k$ clusters, you calculate the F-statistic in the following manner:

$$ F_(g1,g2)=\frac{\frac{(S_{g1}^2-S_{g2}^2)}{S_{g2}^2}}{\frac{n-g_1}{n-g_2}*(\frac{g_2}{g_1})^{2/p}-1} $$

where

$g1$ = $k$ number of clusters

$g2$ = $k+1$ number of clusters

$n$ = number of data objects

$p$ = number of variables

$S^1$ = the sum of the sum of squared deviations from the cluster centers in your division into $k$ clusters

$S^2$ = the sum of the sum of squared deviations from the cluster centers in your division into $k+1$ clusters

Your division of the $n$ objects into $g2$ clusters is significantly better, if the F-statistics exceeds the critical value of an F-distribution with $p(g2-g1)$ and $p(n-g2)$ degrees of freedom.

Related Solutions

Solved – Multivariate Outlier Detection with Robust Mahalanobis

What you are trying to do stems from the following basic idea:

Compute robust distances (using Minimum Covariance Determinant, MCD estimate) of each observation $$rd(x_{i}),\ i = 1 \ldots n,\ x_i \in \mathbb{R}^p, $$ where $n$ and $p$ are the number of observations (rows) and variables (columns) respectively.
Compare each $rd(x_{i})$ to $\sqrt{\chi^2_{p,.975}}$. Declare $x_i$ an outlier if $$rd(x_{i}) > \sqrt{\chi^2_{p,.975}}$$
The robust distances are given by: $$rd(x_i) = \sqrt{(x_i - \mu_{mcd} )'S_{mcd}^{-1} (x_i - \mu_{mcd} )}$$ where $ \mu_{mcd}$ and $S_{mcd}$ are the robust MCD estimates of location (mean vector) and scatter (covariance matrix) respectively.

So therefore, what you need to do is:

A. Get the robust MCD estimate of location ($ \mu_{mcd}$) and scatter ($S_{mcd}$) for your data. You're on the way with the CovMCD(x) function in R. Note that this function implements the FastMCD algorithm by default which repeatedly takes subsamples of your data (each of size denoted by $h$) to make estimates. The number of subsamples to take by default in this function is 500 which explains the reason you're seeing 500 there. The function is not taking 500 observations out of your data. It is instead taking subsamples of size $n_{subsample} = h$ repeatedly to make estimates. 500 of these subsamples will be taken to make estimates. Check Peter J. Rousseeuw & Katrien Van Driessen (1999) A Fast Algorithm for the Minimum Covariance Determinant Estimator, Technometrics, 41:2, 212-223 for more details.

B. Get the robust distances for each row or obsersavtion using the formula in item 3 above.

C. Compare each distance $rd(x_i)$ to $\sqrt{\chi^2_{p,.975}}$ and declare $x_i$ an outlier if $rd(x_{i}) > \sqrt{\chi^2_{p,.975}}$.

And example R code is provided below:

require(rrcov)
data(hbk)
mcd <- rrcov::CovMcd(hbk[,1:3]) # use only first three columns  
# get mcd estimate of location
mean_mcd <- mcd@raw.center
# get mcd estimate scatter
cov_mcd <- mcd@raw.cov

# get inverse of scatter
cov_mcd_inv <- solve(cov_mcd)

# compute distances

# compute the robust distance
robust_dist <- apply(hbk[,1:3], 1, function(x){
  x <- (x - mean_mcd)
  dist <- sqrt((t(x)  %*% cov_mcd_inv %*% x))
  return(dist)
})

# set cutoff using chi square distribution
threshold <- sqrt(qchisq(p = 0.975, df = ncol(hbk[,1:3]))) # df = no of columns

# find outliers
outliers <-  which(robust_dist >= threshold) # gives the row numbers of outliers

Solved – How to prepare/construct features for anomaly detection (network security data)

I'm definitely not an expert on anomaly detection. However, it's an interesting area and here's my two cents. First, considering your note that "Mahalanobis distance could be only applied to normally distributed features". I ran across some research that argues that it is still possible to use that metric in cases of non-normal data. Take a look for yourself at this paper and this technical report.

I also hope that you'll find useful the following resources on unsupervised anomaly detection (AD) in the IT network security context, using various approaches and methods: this paper, presenting a geometric framework for unsupervised AD; this paper, which uses density-based and grid-based clustering approach; this presentation slides, which mention using of self-organizing maps for AD.

Finally, I suggest you to take a look at following answers of mine, which I believe are relevant to the topic and, thus, might be helpful: answer on clustering approaches, answer on non-distance-based clustering and answer on software options for AD.