Solved – Training a Tic Tac Toe brain – am I on the right track

gamesmachine learningpythonscikit learn

My only experience with Machine Learning is Andrew Ng's Coursera course, but I did work through that just fine and passed with 100%.

I decided to practice by making up some problems and solving them. My idea was to create a simple tic tac toe game, and have it just play millions of random games, and use the data to learn to predict the next move based on sequences that won in the past.

Is it reasonable to expect to learn to predict a good move from 1 million samples of games that player X won?

I have a function getWinSequences('X', 1000000) that returns of list of sequences that win for X. A sequence represents series of choices, 1 to 9, on the game board.

The board would look like this:

1|2|3
-----
4|5|6
-----
7|8|9

So WinsX looks like this:

[
    [4,2,5,3,6],
    [1,4,2,8,3],
    [1,2,4,3,5,8,9]
]

So in the first sequence, that means that player X chose 4, O chose 2, X chose 5, O chose 3, X chose 6

So X won with 3 in a row, 4,5,6 ([4,2,5,3,6])

Now, I'm not doing machine learning on the winning sequences, because I want the program to get a "current board position" not a sequence. In other words, given a board with none or some X's and O's, assuming it's player X's turn, predict a good move (1-9) based on past observations a boards, and knowledge of the next move that was chosen

So I have a function that converts a winning sequence into a series of board positions, and a label saying which move is next.

So the actual data that I train on is like this

X[0] = [0,2,2,
        1,1,0,
        0,0,0]
y[0] = 6

0 means empty, 1 means an X is there, 2 means a Y is there. So based on my winning sequences, I am able to automatically label that position as "6", because the board is a recreation of what the board looked like just before we "randomly" won by choosing 6.

Because I want to be able to predict a good move even if there is no winning move, then whenever I have a winning sequence, I don't just create a board as it was just before the win, I create a board for every step along the way of a winning sequence, and label it with the next move from the winning sequence.

So a winning sequence will usually result in 3 to 5 examples of a board position, and a label (1-9) saying what the move for that board should be.

I figured that with a million examples of winning sequences, and the board positions that led to the win, I would be able to predict the next move. Is that reasonable? If so, what is a good way to train on the data? When I trained on 1 million examples, I got a score of .29 from GradientBoostingClassifier.

the relevant code looks like this:

winsX = getWinSequences('X', 1000000)
samplesX = samplesFromWins(winsX)

X = [row[:9] for row in samplesX]
y = [row[9] for row in samplesX]

X_train, X_test = X[:700000], X[700000:]
y_train, y_test = y[:700000], y[700000:]

print("got data, ready to train")

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
    max_depth=1, random_state=0).fit(X_train, y_train)

print(clf.score(X_test, y_test))

I just want to know if I'm on the right track, or if there is a fundamental flaw in what I'm doing.

Best Answer

The main flaw in this method is the following one. You want to predict a move that leads to winning, but instead the model you use tries to guess one certain move from the database that may lead to winning. To fix that, at least you should estimate your model performance not by predicting the “correct” move for the test example, but by the rate of wins it actually achieves. In most cases, there is no single correct move, since multiple moves can lead to winning. For example, for your object

X[0] = [0,2,2,
        1,1,0,
        0,0,0]

the most reasonable prediction is 6, although the training set probably also contains 9, since it may eventually lead to winning as well. However, the move 9 is worse than 6: the rational opponent would win in that case. However, the model treats that moves equally since the training set lacks any losing strategies.

If you still want to treat it as a supervised learning problem, you can instead model the distribution $P(\textit{result} \mid \textit{move}, \textit{board})$ where $\textit{result} \in \{\textit{win} , \textit{draw}, \textit{lose} \}$. To choose the move, just maximize the probability of winning, or minimize the probability of losing. More generally,

$m = \arg\!\min_{move} ~f\Big(P(\textit{result} = \textit{win} \mid \textit{move}, \textit{board}), P(\textit{result} = \textit{lose} \mid \textit{move}, \textit{board}) \Big)$,

where $f$ is some loss function. You can estimate those conditional probabilities from simulations, similarly to what you done for generation of winning strategies.

However, this model has the problem: it treats your opponent’s moves (and your further moves) as being drawn randomly. Game theory advices to choose your strategy assuming that your opponent is rational, i.e. you chose your move to maximize your expected payoff, assuming your opponent then chooses hers by minimizing it. For tic-tac-toe, the best strategy in this sense can be found deterministically by traversing the tree of board configurations, not even exploiting learning from data.

If you want to account for long-term strategies in games like that, you should use techniques from reinforcement learning, which is, roughly speaking, learning not from class labels, but from simulations. Q-learning is one such technique. Here is the lab assignment asking to implement q-learning exactly for tic-tac-toe. It quotes this chapter on theory of reinforcement learning.

Related Question