Solved – How to prepare/construct features for anomaly detection (network security data)

feature selectionfeature-engineeringoutliersunsupervised learning

My goal is to analyse network logs (e.g., Apache, syslog, Active Directory security audit and so on) using clustering / anomaly detection for intrusion detection purposes.

From the logs I have a lot of text fields like IP address, username, hostname, destination port, source port, and so on (in total 15-20 fields). I do not know if there are some attacks in the logs, and want to highlight the most suspicious events (outliers).

Usually, anomaly detection marks the points with low probability/frequency as anomalies. However, half of the log records contain unique combination of fields. So, half of records in the dataset will have the lowest possible frequency.

If I use anomaly detection based on clustering (e.g., find clusters and then select points that are far from all cluster centers), I need to find distance between different points. Since I have 15-20 fields, it will be a multi-dimentional space, where dimesions are username, port, IP address and so on. However, Mahalanobis distance could be only applied to normally distributed features. This means that there is no way to find distance between data points and construct clusters…

For example, let's imagine that I have users Alice, Bob, Carol, Dave, Eve and Frank in the dataset of 20 records. They could have the following number of occurences in the database: 2,5,2,5,1,5. If I simply map usernames to numbers, e.g.

Alice --> 1
Bob --> 2
Carol --> 3
Dave --> 4
Eve --> 5
Frank --> 6

Then, my probability distribution for usernames will look as follows:

p(1) = 0.1,
p(2) = 0.25,
p(3) = 0.1,
p(4) = 0.25,
p(5) = 0.05,
p(6) = 0.25

Of course, this is not a normal distribution, and this also does not make much sense, since I could map usernames in any different way…

Thus, simple mapping of fields like username, action, port number, IP address and so on to numbers does not bring anything.

Therefore, I would like to ask, how the text fields are processed / features constructed usually to make unsupervised anomaly/outlier detection possible?

EDIT: data structure.

I have about 100 columns in the database table, containing information from Active Directory Events. From this 100 columns I select the most important (from my point of view): SubjectUser, TargetUser, SourceIPaddress, SourceHostName, SourcePort, Computer, DestinationIPaddress, DestinationHostName, DestinationPort, Action, Status, FilePath, EventID, WeekDay, DayTime.

Events are Active Directory events, where EventID defines what was logged (e.g., creation of Kerberos ticket, user logon, user logoff, etc.).

Data sample looks like following:

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ID         |SubjectUser|TargetUser|SourceIPaddress|SourceHostName              |SourcePort|Computer                    |DestinationIPaddress|DestinationHostName         |DestinationPort|Action                |Status  |FilePath|EventID|WeekDay|DayTime|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|171390673  |?          |?         |?              |?                           |?         |domaincontroller1.domain.com|1.1.1.1             |domaincontroller1.domain.com|?              |/Authentication/Verify|/Success|?       |4624   |1      |61293  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|173348232  |?          |?         |?              |?                           |?         |domaincontroller2.domain.com|2.2.2.2             |domaincontroller2.domain.com|?              |/Authentication/Verify|/Success|?       |4624   |1      |61293  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|180176916  |?          |?         |?              |?                           |?         |domaincontroller2.domain.com|2.2.2.2             |domaincontroller2.domain.com|?              |/Authentication/Verify|/Success|?       |4624   |1      |61293  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|144144725  |?          |John.Doe  |3.3.3.3        |domaincontroller3.domain.com|2407      |domaincontroller3.domain.com|3.3.3.4             |domaincontroller3.domain.com|?              |/Authentication/Verify|/Success|?       |4624   |3      |12345  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

All together, I have about 150 million events. Different events have different fields filled in, and not all events are related to user logon/logoff.

Best Answer

I'm definitely not an expert on anomaly detection. However, it's an interesting area and here's my two cents. First, considering your note that "Mahalanobis distance could be only applied to normally distributed features". I ran across some research that argues that it is still possible to use that metric in cases of non-normal data. Take a look for yourself at this paper and this technical report.

I also hope that you'll find useful the following resources on unsupervised anomaly detection (AD) in the IT network security context, using various approaches and methods: this paper, presenting a geometric framework for unsupervised AD; this paper, which uses density-based and grid-based clustering approach; this presentation slides, which mention using of self-organizing maps for AD.

Finally, I suggest you to take a look at following answers of mine, which I believe are relevant to the topic and, thus, might be helpful: answer on clustering approaches, answer on non-distance-based clustering and answer on software options for AD.