Solved – Calculating some kind of confidence or error rate for a set of binary data

error

Newbie question here. I'm calculating error rates from software test results. Basically for any particular test run it's going to pass or fail, in this case, because of inherent race conditions in the software in question. We're running about ~50k overall tests a week (some particular tests might be run 4000 times a week, some 200, with all sorts of numbers in between), so I'm starting to quantify the results of these into failure rates.

However, clearly the answer from failing 3 out of 200 tests is a less certain value that failing 30 out of 2000. So given a series of binary data, with no known baseline, is there a standard method using the sample size to provide some error rate?

The other challenge that this data has is we are doing rolling binning. So calculating the failure rates every day. Which means that at some point early in the day there are only a few events. It would be good to be able to quantify the difference in confidence between 4/4 success results vs. 100/100 success results.

Part of my thinking was that you could quantify this as 1/n (n being population size) as you could never know the answer better than the "resolution" you have. However, that's just instincts kicking in, with no statistical backing in there.

Best Answer

Constant failure rate within a given proportion, independent failure events

If the failure rate is constant under a given set of conditions and the tests are independent, then the occurrence of failures follows a Bernoulli process.

In that case, you can generate intervals for the proportion of failures assuming a binomial proportion.

The standard error of the proportion is $\sqrt{ p (1- p)/n}$, which is usually estimated as $\sqrt{\hat p (1-\hat p)/n}$, where $\hat p$ is the sample proportion.

With large sample size (as long as the proportion isn't too small), you might as well use the normal approximation; for an interval of coverage $1-\alpha$, the interval for $p$, the population proportion is

$$\hat p\pm z_{1-\frac{\alpha}{2}}\sqrt{\hat p (1-\hat p)/n}$$

if $p$ is small (so small that $np$ is less than 20 say), you might be better to choose one of the other intervals, such as the Wilson interval or the Clopper-Pearson interval.

In the situation where you have all-0, clearly $np$ is likely to be much less than 20 -- or even 5. Normal approximations cannot be used! A few of the other intervals do okay in this case (though some still need to be truncated to 0 on the left side), but a common approach* is the rule of three. Similar comments apply for all-1 by interchanging success and failure.

*(assuming a 95% interval, though the approach adapts to other coverage probabilities)


If the underlying rate is not constant within a calculated proportion, or there is dependence, then the nature of the variation in rate or the dependence structure (respectively) would usually need to be characterized in some way.

If we can assume independence between trials, we can quantify whether there's over-or-under dispersion relative to the constant-p assumption, and we can look for trends (changing average over time) in binomial proportions, via (say) smoothing splines in logistic regression. If there are, you need to be careful about what your hypotheses might be (you're not generally testing just a difference then).

If you're prepared to assume a mixture distribution with constant combination of binomial proportions within each group being compared (which may be untenable), it might be reasonable to consider some kind of permutation (when testing) or bootstrapping (when constructing intervals) approach.

On the other hand, if we can assume constant $p$, we can check for serial dependence fairly readily.

Related Question