[Math] How to detect anomalies in a real-time data

algorithmsstatistics

I'm trying to find a few good algorithms which could solve the problem of finding anomalies in my data.
One of the main problems is that I need to find it in the real time data.

So I came out with the idea of calculating average as my new data comes in and also calculating standard deviation. After I do that I add standard deviation to my average to get the upper limit. And also take away standard deviation from the average to get bottom limit. And then finally check if that value is outside the range.

That's a quick sketch:

             /\                  /\/\
            /  \                /    \           ** IF VALUE IS ABOVE THEN ALERT
 ----/\----/----\/\----/\------/------\/\----- > upper limit
 ---/--\--/--------\--/--\----/----------\--/- > average
 --/----\/----------\/----\--/------------\/-- > bottom limit
  /                        \/                    ** IF VALUE IS BELOW THEN ALERT

I used the approach from here.

The problems I am facing are:

  • so if I have a big spikes, let's say, every weekend I wouldn't like to get them as anomalies, because they happen every weekend.
    But if I get something odd, even if it's not as big as the weekend's spikes, I'd like to get anomaly

  • also I'd like to have some other algorithm which will look on the volume on Monday this week and volume on Monday on the previous week and compare those values

  • going a bit more into details with the above I'm also trying to figure out how big the time block supposed to be:

    • let's say that I check for volume daily, I could get volume of 46 the first day and 50 the next day which won't be odd. but if I would look at the second hour in the first day the count will be 3 and second hour in the second day 25 this is an anomaly I would like to know about
    • how do I determine the appropriate frequency of measurement of events based on volume and variance of data?

These are the problems I would like to solve, I am not asking for the exact answer, but for pointing the algorithms which will be the best to solve each of these issues

Best Answer

There are a number of techniques that can be applied to detect anomalies in time series. The "running variance" approach illustrated in the link of the OP is not a valid method, since the use of mean $\pm$ k SD is very simplicistic, does not take into account normal irregular behaviours, implies a large loss of information, does not consider noise, and so on.

The methods used in this context can be categorized in three main groups: 1) unsupervised, where there is no need to generate any reference data set and the time series $x_i$ is directly analyzed (under the assumption that the majority of data are normal) by looking for instances that "fit" least to the other instances of the data set; 2) supervised, which require a training data set containing labeled normal and abnormal instances; 3) semi-supervised, where a reference dataset $y_i$ constructed using a normal training time series is taken as a model for normal behaviour, and used to assign an anomaly score to instances of the observed dataset $x_i$. This last group includes a number of different subgroups, such as kernel-based, window-based, predictive, and box-based (also called segmentation-based) methods. In all techniques, instances can be single data points, sequences, subsequences, and so on.

Based on the information given in the OP, in this case an unsupervised approach may probably be the best balance between performance and simplicity. A widely used technique falling into this group is the so-called k-nearest neighbour algorithm, a distance-based method based on the assumption that normal instances occur in "dense" neighborhoods, whereas abnormal instances occur far from their closest neighbors. Briefly, each instance is given an anomaly score, defined as the distance between that instance and the $k^{th}$ nearest neighbor in the data set (with $k$ usually a small integer, often set to $1$). A threshold is then used to identify whether the instance is anomalous or not. Several types of distances can be used, but for continuous variables the Euclidean distance (in 2D, $d=|x_i^2-x_j^2|$) is most commonly adopted.

Lastly, consider that there are a number of possible variants of this method proposed by different Authors. For example, it is possible to define as anomalies simply the largest $m$ distances observed in the data set, thus without the need of defining a threshold. On the other hand, the anomaly score of a given instance can be alternatively calculated by summing the distances from its $k$ nearest neighbors, or by counting the number $j$ of nearest neighbors that are not more than $d$ distance apart from that instance. This last variant, which substantially calculates the density of data in a n-sphere of radius $d$ around each instance, is commonly used because of its simplicity: in one dimension, the density is $j/d$, in 2D (as probably would be the situation of the OP) it is $j/(\pi d^2)$, and so on. The resulting anomaly score is given by the inverse of the density. As a further simplification, some Authors have previously suggested to use, as anomaly score, the ratio $1/j$ once $d$ is predefined, or $d$ (alternatively the ratio $1/d$) once $j$ is predefined.

The unsupervised k-nearest neighbor approach can probably solve most issues reported in the OP. Choosing an appropriate temporal representation (I would suggest to plot data with a cyclic representation, initially using a x-axis range of one week), "normal" big spikes (e.g. in the week-end) are well recognized and not identified as anomalies, whereas even smaller abnormal spikes are promptly identified. The method also allows to easily identify anomalies with respect to that particular day of the week, e.g. a spike occurring on Monday that is abnormal with respect to the "normal" Monday, or to compare two days among different weeks. Lastly, the choice of the appropriate frequency of measurement depends on the temporal definition that we want to achieve in our analysis. Even if some quantitative analyses could explore this issue as well, in most cases the frequency of measurement is chosen empirically according to the desired temporal definition and after visual inspection of data.

Related Question