Solved – Applying machine learning for DDoS filtering

classificationneural networksunsupervised learning

In Stanford's Machine Learning course Andrew Ng mentioned applying ML in IT.
Some time later when I got moderate size(about 20k bots) DDoS on our site I decided to fight against it using simple Neural Network classifier.

I've written this python script in about 30 minutes:
https://github.com/SaveTheRbtz/junk/tree/master/neural_networks_vs_ddos

It uses pyBrain and takes 3 nginx logs as input, two of them to train Neural Network:

  1. With good queries
  2. With bad ones

And one log for classification

From bad queries..

0.0.0.0 - - [20/Dec/2011:20:00:08 +0400] "POST /forum/index.php HTTP/1.1" 503 107 "http://www.mozilla-europe.org/" "-"

…and good…

0.0.0.0 - - [20/Dec/2011:15:00:03 +0400] "GET /forum/rss.php?topic=347425 HTTP/1.0" 200 1685 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9) Gecko/2008052906 Firefox/3.0"

…it constructs a dictionary:

['__UA___OS_U', '__UA_EMPTY', '__REQ___METHOD_POST', '__REQ___HTTP_VER_HTTP/1.0', 
'__REQ___URL___NETLOC_', '__REQ___URL___PATH_/forum/rss.php', '__REQ___URL___PATH_/forum/index.php',
'__REQ___URL___SCHEME_', '__REQ___HTTP_VER_HTTP/1.1', '__UA___VER_Firefox/3.0',
'__REFER___NETLOC_www.mozilla-europe.org', '__UA___OS_Windows', '__UA___BASE_Mozilla/5.0',
'__CODE_503', '__UA___OS_pl', '__REFER___PATH_/', '__REFER___SCHEME_http', '__NO_REFER__',
'__REQ___METHOD_GET', '__UA___OS_Windows NT 5.1', '__UA___OS_rv:1.9',
'__REQ___URL___QS_topic', '__UA___VER_Gecko/2008052906']

Each entry that we train our network with / entry that we need to classify

0.0.0.0 - - [20/Dec/2011:20:00:01 +0400] "GET /forum/viewtopic.php?t=425550 HTTP/1.1" 502 107 "-" "BTWebClient/3000(25824)"

… gets converted to feature-vector:

[False, False, False, False, True, False, False, True, True, False, False, False, False, False, False, False, False, True, True, False, False, False, False]

After all of this there is standard path of splitting dataset into training and test sets, training neural networks and selecting best one. After this process (that can take pretty long time depending on dataset size) we can finally classify logs using trained network.

But here are number of issues with that approach:

  1. Supervised machine learning is kinda wrong for that type of problem, because to detect bots I first need to detect bots and train Neural Network with that data.
  2. I do not take client's behavior into an account. It's better to consider graph of page to page transitions for each user.
  3. I don't take clients locality into an account. If one computer in network is infected with some virus then there are more chances that other computers in that network are infected.
  4. I don't take a geolocation data into an account. Of course if you are running site in Russia there is little chance of clients from Brazil.
  5. I don't know if it was right way to use neural network and classification for solving such problem. May be I was better off with some anomaly detection system.
  6. It's better when ML method is "online" (or so-called "streaming") so it can be trained on the fly.

So here is the questions:
What would you do if you were faced with same problem of defending against of a DDoS attack given only current webserver logs (that consists of good clients and bots) and historical data (logs for previous day/week/month with mostly good clients)?
Which Machine Learning approach would you choose.
Which algorithms would you use?

Best Answer

How about anomaly detection algorithms? As you mention Andrew Ng's class you'd probably seen the "XV. ANOMALY DETECTION" section on ml-class.org, but anyway.

Anomaly detection will be superior to a supervised classification in scenarios similar to yours because:

  • normally you have very few anomalies (ie., too little "positive" examples)
  • normally you have very different types of anomalies
  • future anomalies may look nothing like the ones you've had so far

Important point in anomaly detection is, which features to choose. Two common advices here are to choose features with

  • Gaussian distribution (or distort them to be like that)

  • probability p(anomaly) be incomparable to p(normal) - say, anomalous values being very large while normal ones being very small (or vice versa).

I'm not sure if geolocation would help for your scenario, but client behavior would definitely matter - although it would probably differ from application to application. You may find that a ratio of GETs/POSTs matters. Or a ratio of response size to request count. Or number of single page hits. If you have such info in logs - definietly you can use the data for retrospective analysis, followed by IP blacklisting :)

Related Question