Solved – Pooling data for logistic regression


I want to run a logistic regression on greyhound races. For each race I have a dummy variable (y) that takes value one when the dog wins and zero otherwise.

Unfortunately the number of hounds in each race can vary as some are withdrawn for whatever reason. Currently I pool the data by vertically concatenating to create one huge column or race results and one large column for each independent variable.

  1. Is this the correct way to pool the data for this type of problem?
  2. Are there any issues with the fact that the data originally came from separate races often with different numbers of dogs running?

Best Answer

This is all wrong-headed. First, note that there is no meaningful ontological status of 'winner'.

How to determine the quality of something when all you have is a set of results from head-to-head comparisons (e.g., sports teams based on the results of games in a season) is a very tricky question. In the simplest case, a Bradley-Terry model could be used to predict the probability that unit $i$ will beat unit $j$. Bayesian network analyses can also be used.

A Bradley-Terry model wouldn't quite work in your case, but your case is actually a lot simpler: You presumably already have data directly on the quality of each dog as a racing dog. Specifically, you should have each dog's race times. A better race dog is just a faster dog. If you want to determine what variables are related the ability of a race dog, you need to model racing times. If you want to rank existing dogs, you could fit a Bayesian model, or a mixed effects model and look at the BLUPs. If you wanted to estimated probabilities that dog A will win a given race (e.g., for book-making purposes), you could take fitted race time distributions for each dog in the race and simulate to generate the proportion of the runs that dog A has the lowest time.

As I understand your situation now from your comment, I gather you want to determine if odds that were given in the past (by whatever method) were reasonable given what you now know about whether a dog actually won its race. This is a different situation than I thought you were asking about in the body of the question. Here you aren't trying to build a model of any type, you are only trying to assess the calibration of the starting odds.

First, note that the odds that a bookmaker (e.g., the track) will offer / list are not the odds that they think are fair. They have to add a cut in order to make a living (cf., Odds made simple). So you need to remove that to get to the actual odds that were believed to be fair.

Once you have those numbers, the simplest check is that they should imply a 100% chance of one of the listed dogs winning. For example, if there were only two dogs and one had an estimated odds of winning of 1 to 3, the other dog's odds should be 3 to 1; if it were 10 to 1, something doesn't add up.

To answer your specific question, if the odds add up, you needn't take into account the number of dogs in a race, because the odds being offered are supposed to account for that, and if they don't, that's something you want to discover.

At this point, you could assess the discriminative performance of the odds by computing Somer's D, which is informationally equivalent to the area under the receiver operating characteristic curve (AUC).

Lastly, you could convert the fair odds into the log odds of winning and use them as a single predictive variable in a logistic regression model. The intercept and slope of that model should be $0$ and $1$, if the odds are not biased.

Related Question