Solved – How to formulate data for neural network with “class” inputs and a numerical output

neural networkspythontrain

I'm just starting to play with neural networks (via PyBrain). I've got some questions about problem formulation. I've taken a bunch of rugby data (very topical), which consists of a list of match results i.e. two team names, and the points difference (Team A score – Team B score).

I'm really not sure how to represent this information as a dataset suitable for training my neural network.

So far I've tried:

  1. Two inputs, representing unique number IDs for each team, and a points diff output
  2. A boolean array, with teams not playing set to zero, Team A set to 1 and Team B set -1, and a points diff output
    1. Adding each match twice, but expressed from the point of view of the other team i.e. Team A vs Team B, points diff 10; and Team B vs Team A points diff -10.

Have I completely misunderstood how to represent the data?

I've seen training datasets which take a numeric input and a classification output e.g. ClassificationDataSet in PyBrain, but ideally I think I might want the other way round i.e. two class inputs (i.e. the two team names) and a numeric output.

I should be clear, I'm not expecting tonnes of skill here, but I would expect it to learn things like:

  1. The predicted point difference between Team A vs Team B is -ve that of Team B vs Team A.
  2. New Zealand vs Algeria should really be a win for NZ

Best Answer

I have tried a similar approach for soccer matches, with moderate success.

The input to your network, IMO, should be a vector containing all possible information about the match, if you wanted to predict its outcome, like home team (uniqueID), away team (uID), points or ranking of these teams at the moment of the match, points or mean-points in the last N matches (3, 5, 7, etc), goals scored or mean in the last N matches, and some other probably relevant information. You have to take into account that the more information the algorithm could get about the trends of the teams, the more accuracy you could get from them.

It is a particular detail of the software you are using if the teams should be represented by name or uniqueID, depending on the translation between them that the software could do internally. Other approaches relate to model the scored goals for every team, via numeric regression.

Using the first approach I can get around 60% to 65% accuracy on the 1,X,2 prediction (home win, tie, away win).

One note to take into heavy account is that, due to the nature of the studied domain (chaotic, like weather prediction, economy, etc), it is very possible that a clear and easy win for a big team turns into a defeat, and the opposite too. It is a turbulent domain where classes can not be clearly separated, as I have tried to do via Self-Organizing Maps. In some other domains it's quite easy to project into different regions, but not for sports forecasting. That's what makes betting interesting and risky.

Hope to have helped.