Solved – Comparison of waiting times to geometric distribution

geometric-distribution

I am analysing data taken from observing about one million people over 24 months. For each person, each month is classified as a "success" or a "failure". I am specifically interested in the distribution of waiting times (= the lengths of runs of failures) between successes and in comparing this distribution to the distribution that would arise if success was a Bernoulli process – the Geometric distribution.

My approach has been to firstly to identify three subsets of people in my original group – those with the 3 total successes in 24 months, those with 8 and those with 12. The rationale is that only groups with a single success probability throughout can be compared to a geometric distribution, which is itself parameterised by a single success probability. The three specific values 3, 8 and 12 total successes out of 24 I just selected pretty arbitrarily to reflect the range of interest.

Let me use the group of people with 3 out of 24 successes as an example. Just using total counts, we can estimate the success probability for this group as
$$ \hat{p} = \frac{3}{24} = 0.125 $$

I then proceed to graph the actual waiting times histogram for the 3 out of 24 group against the Geometric distribution with parameter $p=0.125$ and find that e.g. the observed frequency of waiting time = 0 months is substantially higher than frequency of 0 months given by the Geometric distribution and I interpret this as meaning that, for the 3 out of 24 group, two successes in a row occur more often than they would if success was a Bernoulli process.

However, I can also compare to a different Geometric distribution where I estimate its parameter $p$ using the method of moments, or by equating the mean waiting time observed in the sample of 3 out of 24, $\mu$, to the expected waiting time for the Geometric distribution as follows
$$ \frac{1-\hat{p}}{\hat{p}} = \mu$$

This gives me an estimate of $\hat{p} \approx 0.169$ which is very different to $\hat{p}=0.125$. Visually, this distribution fits the 3 out of 24 waiting time data a lot better but the observed data still deviate clearly for being Geometrically distributed, the deviations just show up in different places now. I could do a statistical test of the fit, but because of my very large sample size I have no doubt it will tell me that the data differ from the Geometric distribution at any significance level I like.

I have favoured the first method of finding $\hat{p}$ because

  1. I don't want to fit a Geometric distribution to my data. I know the data is not quite geometrically distributed and I am specifically interested in the systematic differences (as opposed to measurement error) between it and the appropriate Geometric distribution.
  2. Since the data isn't Geometrically distributed, this makes me think that the method of moments will be "thrown off" and will not give me the parameter of the Geometric distribution that I should use as a baseline for comparison, but rather of another Geometric distribution that happens to fit the data better.

Is what I'm doing legitimate and how should I determine $\hat{p}$?

(I can provide an R example if I haven't been precise enough in expressing the problem as a theory problem.)

Best Answer

I believe I have answered my own question by figuring out (thanks to a colleague) that what I was doing was not legitimate after all. The distribution of waiting times in a set that is determined in advance to consist of 24 Bernoulli trials, 3 of which are successful, is not given by the Geometric distribution. Instead it is given by a similar but different distribution that is supported only on the finite set 0...21. I haven't been able to solve the combinatorics to write down this distribution, but I was able to determine it for 24 trials using computational brute force (I generated every combination of 3 successes out of 24 and then just counted waiting time frequencies). I think the Geometric distribution, which is supported on all of $\mathbb N$, is obtained in the limit as the number of trials goes to infinity.