Solved – Akinator.com and Naive Bayes classifier

machine learningnaive bayes

Context: I am a programmer with some (half-forgotten) experience in statistics from uni courses. Recently I stumbled upon http://akinator.com and spent some time trying to make it fail. And who wasn't? 🙂

I've decided to find out how it could work. After googling and reading related blog posts and adding some of my (limited) knowledge into resulting mix I come up with the following model (I'm sure that I'll use the wrong notation, please don't kill me for that):

There are Subjects (S) and Questions (Q). Goal of the predictor is to select the subject S which has the greatest aposterior probability of being the subject that user is thinking about, given questions and answers collected so far.

Let game G be a set of questions asked and answers given: $\{q_1, a_1\}, \{q_2, a_2\} … \{q_n, a_n\}$.

Then predictor is looking for $P(S|G) = \frac{P(G|S) * P(S)}{P(G)}$.

Prior for subjects ($P(S)$) could be just the number of times subject has been guessed divided by total number of games.

Making an assumption that all answers are independent, we could compute the likelihood of subject S given the game G like so:

$P(G|S) = \prod_{i=1..n} P(\{q_i, a_i\} | S)$

We could calculate the $P(\{q_i, a_i\} | S)$ if we keep track of which questions and answers were given when the used have though of given subject:

$P({q, a} | S) = \frac{answer\ a\ was\ given\ to\ question\ q\ in\ the\ game\ when\ S\ was\ the\ subject}{number\ of\ times\ q\ was\ asked\ in\ the\ games\ involving\ S}$

Now, $P(S|G)$ defines a probability distribution over subjects and when we need to select the next question we have to select the one for which the expected change in the entropy of this distribution is maximal:

$argmax_j (H[P(S|G)] – \sum_{a=yes,no,maybe…} H[P(S|G \vee \{q_j, a\})]$

I've tried to implement this and it works. But, obviously, as the number of subjects goes up, performance degrades due to the need to recalculate the $P(S|G)$ after each move and calculate updated distribution $P(S|G \vee \{q_j, a\})$ for question selection.

I suspect that I simply have chosen the wrong model, being constrained by the limits of my knowledge. Or, maybe, there is an error in the math. Please enlighten me: what should I make myself familiar with, or how to change the predictor so that it could cope with millions of subjects and thousands of questions?

Best Answer

This game looks similar to 20 questions at http://20q.net, which the creator reports is based on a neural network. Here's one way to structure such network, similar to the neural network described in Concept description vectors and the 20 question game.

You'd have

  1. A fixed number of questions, with some questions marked as "final" questions.
  2. One input unit per question, where 0/1 represents no/yes answer. Initially set to 0.5
  3. One output unit per question, sigmoid squished into 0..1 range
  4. Hidden layer connecting all input units to all output units.

Input units for questions that have been answered are set to 0 or 1, and the assumption is that neural network has been trained to make output units output values close to 1 for questions that have "Yes" answer given a set of existing answers.

At each stage you would pick the question which NN is the least sure about, ie, corresponding output unit is close to 0.5, ask the question, and set corresponding input unit to the answer. At the last stage you pick an output unit from the "final question" list with value closest to 1.

Each game of 20 questions gives 20 datapoints which you can use to update NN's weights with back-propagation, ie, you update the weights to make the outputs of current neural network match the true answer given all the previous questions asked.