Anomaly Detection – How to Detect Anomalies with Dummy and Categorical Features

anomaly detectioncategorical datadiscrete datamachine learningoutliers

tl;dr

  • What is the recommended way to deal with discrete data when performing anomaly detection?
  • What is the recommended way to deal with categorical data when performing anomaly detection?
  • This answer suggests using discrete data to just filter the results.
  • Perhaps replace the category value with the perctage chance of observation?

Intro

This is my first time posting on here, so please, if anything doesn't seem technically correct, either in the formatting, or the use of correct definitions, I'm interested to know what should've been used instead.

Onwards.

I've recently been taking part of the Machine Learning class by Andrew Ng

For anomaly detection we've been taught to determine what the Normal/Gaussian distribution parameters are for a given feature/variable, ${x_i}$ within a data set, and then determine the probability of a chosen set of training example's/observation's value given that particular Gaussian distribution, and then taking the product of the probabilities of the features.

Method

Choose $x_i$ features/variables that we think explain the activity in question:
$$\{x_1, x_2,\dots,x_i\}$$

Fit the parameters of the Gaussian for each feature:
$$\mu_j = \frac{1}{m}\sum_{i = 1}^m x_j^{(i)}$$
$$\sigma^2 = \frac{1}{m}\sum_{i = 1}^m (x_j^{(i)} – \mu_j)^2$$

For each training example, $x$, compute:
$$p(x) = \prod_{j = 1}^n \ p(x_j; \mu_j, \sigma_j^2)$$

We then flag as an anomaly ($y = 1$), given:
$$y = \left\{
\begin{array}{l l}
1 & \quad p(x) < \epsilon\\
0 & \quad p(x) \geq \epsilon
\end{array} \right.$$

This gives us the method with which to determine if an example requires further inspection.

My Question(s)

This seems fine for continuous variables/features, but discrete data is not addressed.

What about dummy variables, e.g. a gender flag feature, possibly called [IsMale] that can be of the value $0, 1$? To take a dummy feature into account would we use the binomial distribution instead to calculate $p(x)$?

What about categorical data such as car colour? While we could map colours to numerical values, e.g. $red \to 1, blue \to 2$, the distribution of such a categorical feature could be close to uniform (i.e. equally likely chance to be any of the colours), and further, as any numerical mapping that occurs (i.e. $red$ having the value $1$, etc) is not ordinal, does it make sense to try and transform any non-normal distribution of frequencies for colours to be normally distributed (does it even matter that it is not ordinal??)? For example, to me, it wouldn't make sense to do a $log()$ transform as the data is neither continuous nor ordinal. So perhaps it would be best to find a discrete distribution that fits the feature, as opposed to "torturing" the data to fit the Gaussian?

Questions: (updated: 2015-11-24)

  • Can binary variables be modeled with a binomial probability distribution and become another factor in the $p(x)$ calculation?
  • Should categorical variables should be modeled with a discrete probability distribution instead of a Gaussian, and become another factor in the $p(x)$ calculation?
  • Is there another method altogether that takes into account what I'm asking here that I can further research/learn about?
  • What is the recommended way to deal with discrete data when performing anomaly detection?
  • What is the recommended way to deal with categorical data when performing anomaly detection?

Edit: 2017-05-03

  • This answer suggests using discrete data to just filter the results.
  • Perhaps replace the category value with the perctage chance of observation?

Best Answer

In general, for both discrete* & categorical features, this method isn't particularly amenable to outlier analysis. Since there is no magnitude associated with categorical predictors, we are working with:

  • Frequency of the category being observed in the global data
  • Frequency of the category being observed within subspaces of the data

Note that neither of these qualities can be analyzed in isolation, as your Gaussian method requires. Instead, we need a method that contextualizes categorical features & considers the correlational nature of the data.

Here are some techniques for categorical & mixed attribute data, based on Outlier Analysis by Aggarwal:

  • If you can define a similarity function which builds a positive semidefinite matrix across all observations (regardless of data types), compute the similarity matrix $S$, find its diagonalization $S=Q_k\lambda_k^2Q_k^T$, and use the non-zero eigenvectors $Q_k$ to compute a feature embedding $E = Q_k\lambda_k$ . For each row (observation) in $E$, compute its distance from the centroid; this is your outlier score, and you can use univariate methods to determine outliers.
  • If you have purely categorical features, fit a mixture model to the raw categorical data. Anomalous points have lowest generative probability.
  • Use one-hot encoding for categorical predictors and optionally latent variable analysis** for ordinal variables with non-apparent continuous mappings
    • Standardize the non-one-hot features (one-hot features are already implicitly standardized) and perform Principal Component Analysis. Perform dimensionality reduction using the top principal components (or a soft PCA approach where eigenvectors are weighted by eigenvalues) and run a typical continuous outlier analysis method (e.g. a mixture model or your Gaussian method)
    • Perform an angle-based analysis. For each observation, compute cosine similarities between all pairs of points. Observations with the smallest variance of these similarities (known as the "Angle-Based Outlier Factor") are most likely outliers. May require a final analysis of the empirical distribution of ABOF to determine what is anomalous.
    • If you have labelled outliers: Fit a predictive model to the engineered data (logistic regression, SVM, etc.).

*Discrete features could possibly be handled approximately in your Gaussian method. Under the right conditions, a feature may be well approximated by a normal distribution (e.g. binomial random variable with npq > 3). If not, handle them as ordinals described above.

**This is similar to your idea of "replace the category value with the percentage chance of observation"