Time Series – Detecting Unusual Trends and Anomalies Using Mixed Data (Categorical and Numerical)

data miningmixed type dataoutlierstest-for-trendtime series

I've been asked to detect "unusual trends and anomalies" using data similar to ATM transaction data. Each entry has a mixture of numerical and categorical variables, things like transaction ID, timestamp, transaction type, transaction amount, etc. There are about 10 categorical and 10 numerical variables. The goal of the project would be to write a script that gives an alert in real time when unusual trends/anomalies are detected in newly logged data.

The exact definition of "unusual trend" or "anomaly" haven't been given to me, and there are no labels to tell me which rows of the dataset are "usual".

  1. To detect anomalies (outliers) I would like to use a distance based measure. I'm not used to calculating these using categorical data, but I believe I could use something like Gower similarity. I could also transform categorical data into a binary vector, e.g. if there are only two transaction types, "withdrawal" = [1 0]. Would it be appropriate to look for outliers from these secondary, all-numerical variables?

  2. I'm less sure how to detect unusual trends. Outlier detection seems inappropriate, since "unusual trends" might not necessarily include data points that are outliers by themselves. If it was a time series, I'd want to use something like a seasonal ARIMA, an autocorrelation function, or something similar. How appropriate is it to change data like what I have (variable time step, categorical + numerical) into a time series? If that's not a good approach, what kinds of models are appropriate for detecting trends in this kind of data?

Thanks a lot. Any help or insight is hugely appreciated!

Best Answer

What you want to do ( and what I have done ) is to:

  1. Take these (what I would assume) non-standard daily or hourly readings and "bucket them" into periods.

  2. For each transaction and for each kind of machine, create a time series.

This sequence of values can then be analyzed for things like daily, weekly, monthly, and memory effects using a Transfer Function.

This analysis could yield the detection of level shifts and trends using Intervention Detection schemes. There may also be one-time effects that need to be identified and neutralized (anomalies) in order to not distort model parameters.

There may be changes in daily patterns over, error variance, or other model coefficients over time.

There may be (read:will be!) holiday effects around known holidays. Particular days of the month may have an effect. Particular weeks of the month may have an effect. There may be weekend - effects. Lots of things to explore and find out.

With that, I say that if you wish to post some daily data from a 3-4 year period I might be able to help further. If you wished to downsize this to hourly forecasts this could also be done at a later stage. A useful model not only characterizes/describes historical data but also can provide early warning as to the onset of change.

Unusual values do know have to be specified up front but arise detecting values inconsistent with the past.

Related Question