Solved – Simple machine learning: bot detection

machine learningpython

I've been aching to get my feet wet with a machine learning project, and I've found one that should be relatively simple, and actually has non-negligible business value for my organization. The marketing guys have to remove bot activity from our tracking data by hand for their metrics. I wanted to pull some data from GA, and have them construct a data set (bot, not-a-bot). There are probably 5-10 (numerical) categories that we have to train the algorithm, and the data set can be made as big as the marketing guys have an appetite for.

I've done a bit of reading, and played with RapidMiner/Knime/Weka a bit. I plan to do everything in Python, with scikits-learn, possibly working in R where I have to. My questions:

  1. Is this a "not actually that easy at all" problem?
  2. Given the number of categories, about how large should the training
    set be?
  3. Given the problem, what algorithms should I start with?
  4. Has anyone else done any learning around bot detection? How did it work? Am I barking up the wrong tree?

Thanks in advance community!

Best Answer

In my opinion:

  1. No! This can be done. The difficulty will be determined by how accurate the algorithm needs to be. If you need something very close 100% accuracy (meaning that the algorithm will identify a bot 100 times out of 100), then it can become quite difficult.

  2. The size of the data set doesn't depend on the number of categories, it depends on the variation within those categories. You can start with a relatively small (around 200 samples) data set and see how the algorithm does. If the algorithm is not sensitive enough, then you can try training it on an increased data set.

  3. I recommend an Evolutionary Algorithm. k-Nearest Neighbors (k-NN) could be used as the fitness function.

  4. I haven't done anything with bot detection. However, this problem can be classified as a "feature selection and classification" problem. This is because you design the algorithm to pick a set of features (categories in your case) that can classify a hitherto unseen sample as either "bot" or "non-bot". Evolutionary algorithms are extremely good at feature selection and classification and are relatively easy to implement. I know they can be implemented in Python without too many problems.