Solved – Training a neural network on chess data

adamlarge datamachine learningneural networkspython

I have been writing a chess engine with a friend and the engine itself is really good already (2700+ CCRL). We had the idea to use a neural network to have a better evaluation of positions.

Input to the network

because the output of the network greatly depends on which side has to move, we use the first half of the inputs to parse the position of who has to move and the second half for the opponent. In fact, we have for each piece and for each square an input which would result in 12×64 inputs. We had the idea to also include the opponent king position. So each side had 6×64 inputs and this for each square the opponent king can be -> 6x64x64. In total, this results in 12x64x64 binary input values where at maximum 32 are set.

Layers

The next layer consists of 64neurons where the first 32 neurons only accept inputs from the first half of the input features and the last 32 only accept inputs from the second half of the input features.

It follows a layer with 32 neurons fully connected and the output layer has only a single output.

Activation function

We use LeakyReLU at both hidden layers and a linear activation function at the output.

Training

Initially, I wanted to train the network on about 1 million positions yet this is taking ages. The position itself has a target value in the range of -20 to 20. I am using stochastic gradient descent using ADAM with a learning rate of 0.0001 and MSE as the loss function.

The problem I have is that this is taking a very very long time to even train those 1 million positions. The target is to later train on 300M positions.

I am not sure where I could improve the training progress.

Below are the graphs which show the training progress over 1000 iterations

The change for each iteration looks like this:

I hope someone could give me one or two hints on what I could improve in order to train the network faster. I am very happy for any advice!

Greetings,
Finn

Edit 1

As suggested, I should convert my network to keras. I am having problems getting the sparse input to run.

import keras
from keras.layers import Input, Concatenate, Dense, LeakyReLU
from keras.models import Model
from keras import backend as K
import numpy as np







# trainX1 = tf.SparseTensor(indices=[[0,0], [0,1]], values=[1, 2], dense_shape=[1,24576])
# trainX2 = tf.SparseTensor(indices=[[0,0], [0,1]], values=[1, 2], dense_shape=[1,24576])
#
# trainY = np.random.rand(1)


trainX1 = np.random.random((10000,24576))
trainX2 = np.random.random((10000,24576))

trainY = np.zeros((10000,1))



#input for player to move
activeInput = Input((64*64*6,))
inactiveInput = Input((64*64*6,))


denseActive = Dense(64)(activeInput)
denseInactive = Dense(64)(inactiveInput)


act1 = LeakyReLU(alpha=0.1)(denseActive)
act2 = LeakyReLU(alpha=0.1)(denseInactive)

concat_layer= Concatenate()([act1, act2])
dense1 = Dense(32)(concat_layer)

act3 = LeakyReLU(alpha=0.1)(dense1)

output = Dense(1, activation="linear")(act3)

model = Model(inputs=[activeInput, inactiveInput], outputs=output)
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])

# print(model.summary())

print(model.fit([trainX1,trainX2], trainY, epochs=1))

If I use sparse=True for the Dense layer, it will throw some exceptions. I am happy if someone could help me creating sparse input vectors.

Best Answer

I think you need to consider running it on a GPU. Google Colab is free and Amazon AWS is very cheap. You seem to know what you are doing so you can probably get up and running with PyTorch very quickly. Once you compare the performance of the same network implemented on GPU vs your single processor setup, you will be in a better to position to know where to go next.

Relation to Word2Vec

==========================================

Word2Vec in a simple picture:

word2vec pic

More in-depth explanation:

I believe it's related to the recent Word2Vec innovation in natural language processing. Roughly, Word2Vec means our vocabulary is discrete and we will learn an map which will embed each word into a continuous vector space. Using this vector space representation will allow us to have a continuous, distributed representation of our vocabulary words. If for example our dataset consists of n-grams, we may now use our continuous word features to create a distributed representation of our n-grams. In the process of training a language model we will learn this word embedding map. The hope is that by using a continuous representation, our embedding will map similar words to similar regions. For example in the landmark paper Distributed Representations of Words and Phrases and their Compositionality, observe in Tables 6 and 7 that certain phrases have very good nearest neighbour phrases from a semantic point of view. Transforming into this continuous space allows us to use continuous metric notions of similarity to evaluate the semantic quality of our embedding.

Explanation using Lasagne code

Let's break down the Lasagne code snippet:

x = T.imatrix()

x is a matrix of integers. Okay, no problem. Each word in the vocabulary can be represented an integer, or a 1-hot sparse encoding. So if x is 2x2, we have two datapoints, each being a 2-gram.

l_in = InputLayer((3, ))

The input layer. The 3 represents the size of our vocabulary. So we have words $w_0, w_1, w_2$ for example.

W = np.arange(3*5).reshape((3, 5)).astype('float32')

This is our word embedding matrix. It is a 3 row by 5 column matrix with entries 0 to 14.

Up until now we have the following interpretation. Our vocabulary has 3 words and we will embed our words into a 5 dimensional vector space. For example, we may represent one word $w_0 = (1,0,0)$, and another word $w_1 = (0, 1, 0)$ and the other word $w_2 = (0, 0, 1)$, e.g. as hot sparse encodings. We can view the $W$ matrix as embedding these words via matrix multiplication. Therefore the first word $w_0 \rightarrow w_0W = [0, 1, 2, 3, 4].$ Simmilarly $w_1 \rightarrow w_1W = [5, 6, 7, 8, 9]$.

It should be noted, due to the one-hot sparse encoding we are using, you also see this referred to as table lookups.

l1 = EmbeddingLayer(l_in, input_size=3, output_size=5, W=W)

The embedding layer

 output = get_output(l1, x)

Symbolic Theano expression for the embedding.

f = theano.function([x], output)

Theano function which computes the embedding.

x_test = np.array([[0, 2], [1, 2]]).astype('int32')

It's worth pausing here to discuss what exactly x_test means. First notice that all of x_test entries are in {0, 1, 2}, i.e. range(3). x_test has 2 datapoints. The first datapoint [0, 2] represents the 2-gram $(w_0, w_2)$ and the second datapoint represents the 2-gram $(w_1, w_2)$.

We wish to embed our 2-grams using our word embedding layer now. Before we do that, let's make sure we're clear about what should be returned by our embedding function f. The 2 gram $(w_0, w_2)$ is equivalent to a [[1, 0, 0], [0, 0, 1]] matrix. Applying our embedding matrix W to this sparse matrix should yield: [[0, 1, 2, 3, 4], [10, 11, 12, 13, 14]]. Note in order to have the matrix multiplication work out, we have to apply the word embedding matrix $W$ via right multiplication to the sparse matrix representation of our 2-gram.

f(x_test)

returns:

          array([[[  0.,   1.,   2.,   3.,   4.],
                  [ 10.,  11.,  12.,  13.,  14.]],
                 [[  5.,   6.,   7.,   8.,   9.],
                  [ 10.,  11.,  12.,  13.,  14.]]], dtype=float32)

To convince you that the 3 does indeed represent the vocabulary size, try inputting a matrix x_test = [[5, 0], [1, 2]]. You will see that it raises a matrix mis-match error.