Solved – Anomaly Detection over multivariate categorical and numerical predictors

anomaly detectioncategorical dataisolation-forestpcapython

I am trying to implement Anomaly Detection over a multivariate dataset having categorical and numerical predictors.

If we consider the below sample records, product_type, company_type and currency are categorical variables(Nominal, to be precise), whereas price is a numerical variable.

My model is able to identify the anomaly in the price for product_id=10 because the price range for different products is between 10-500 EUR for given combination of (product-type, company-type, price, currency).

But it is not able to identify anomalies for product_id=5 or product_id=8 as they have unusual currency or product type.

Dataset example

I have tried different approaches like Multiple Correspondance Analysis(MCA) for categorical encoding and dimensionality reduction along with One class-SVM and Isolation Forest. I have even tried deep learning approaches using Autoencoders. But none of the models is able to identify anomalies in categorical predictor variables.

I have even referred other answers like:

Anomaly Detection with Dummy Features (and other Discrete/Categorical Features)

and

Outlier detection with data (which has categorical and numeric variables) with R

but could not resolve my problem.

I have recently started data science journey and would really appreciate any help.

Best Answer

It depends of what you call an outlier. If for you, an outlier for categorical data is a category that appear less than, say, 1% of the time then there is a really easy algorithm to detect those: just count the number of values for each category (for example with pandas value_counts) and threshold this to find which category are abnormal in your sense. This type of outliers is not detected by one class svm or isolation forest because they are not considered abnormally large data, in short algorithms will consider that your dataset is unbalanced but it will not detect YEN as abnormal in most cases.