Solved – Using (leaky) ReLU as an activation function


I am building a convolutional neural network to classify certain types of greyscale images (chess puzzles).

I can see obvious advantages for using ReLU as my activation function e.g. I can treat the input layer feed to the first hidden layer in the same way as the input into any other hidden layer in terms of the domain of the input.

But I have a question about how to use ReLU activations at the other end – i.e. as the output of the fully connected output layer.

I am looking to classify images of chess boards and so it seemed to me obvious that I could classify a part of the board as being a white square or a black square and then with some other (optional, as it were) classifications e.g. with pawn which itself could be black or white.

I cannot really see how to do this with ReLU – and so that suggests that what I really have to do is implement every possible outcome as a separate output neuron e.g., empty black square, empty white square, white square with white pawn, white square with black pawn, black square with white pawn … etc.

Is that correct?

(I originally asked this on Stack Overflow and got no answers, but a recommendation to ask here. So I have deleted the original and posted this. Since I originally asked I have built a network and, using random weights – learning is about to begin, see that the overall results, after several layers of leaky ReLU activated filters and a leaky ReLU fully connected layer, don't stray too far from the 0 – 1 range, though this may be luck more than anything else: but I am still intrigued by this problem so would welcome answers.)

Best Answer

There are two possible approaches to this, neither of which involve ReLU at the last layer. These are usually only used for intermediate layers with their advantages, not the output layers. You have a multi-class classification problem, which you could solve in either of two ways:

Classes for every option as you mentioned, where for every option you have an output layer and you do a softmax over each of the layers. Then you can use categorical cross entropy as a loss function to train on this. A problem with this approach is a 'high' amount of classes (not actually that high though) and the fact that the model has to learn the same thing for multiple classes (color of the field for example).

The (I think preferred) alternative is to actually make three output layers, one which tries to identify the type of piece (6 options), the color of the piece (2 options) and the color of the field (also 2 option). For the first you need the categorical cross entropy again, for the other two you can use binary cross entropy. Then you can combine these loss functions into one loss function by just summing them and train your model. Use softmax for the first option and sigmoids for the other two.

Related Question