Solved – How to combine the results of several binary tests

bayesianclassificationdiagnostic

First off let me say that I had one stats course in engineering school 38 years ago. So I'm flying blind here.

I've got the results of what are essentially 18 separate diagnostic tests for a disease. Each test is binary — yes/no, with no threshold that can be adjusted to "tune" the test. For each test I have what is ostensibly valid data on true/false positives/negatives when compared to the "gold standard", yielding specificity and sensitivity numbers (and anything else you can derive from that data).

Of course, no single test has sufficient specificity/sensitivity to be used alone, and when you "eyeball" the results of all tests there's frequently no obvious trend.

I'm wondering what is the best way to combine these numbers in a way that will yield a final score that is (hopefully) more reliable than any single test. I've so far come up with the technique of combining of the specificities of TRUE tests using

spec_combined = 1 - (1 - spec_1) * (1 - spec_2) * ... (1 - spec_N)

and combining sensitivities of the FALSE tests the same way. The ratio

(1 - sens_combined) / (1 - spec_combined)

then seems to yield a reasonably good "final score", with a value over 10 or so being a reliable TRUE and a value under 0.1 or so being a reliable FALSE.

But this scheme lacks any true rigor, and for some combinations of test results it seems to produce an answer that is counter-intuitive.

Is there a better way to combine the test results of multiple tests, given their specificities and sensitivities? (Some tests have a specificity of 85 and sensitivity of 15, other tests are just the opposite.)

OK, my head hurts!

Let's say I've got tests 1-4 with sensitivities/specificities (in %):

65/50
25/70
30/60
85/35

Tests 1 and 2 are positive, 3 and 4 negative.

The putative probability that 1 is a false positive would be (1 – 0.5), and for 2 (1 – 0.7), so the probability that both are false positives would be 0.5 x 0.3 = 0.15.

The putative probability that 3 and 4 are false negatives would be (1 – 0.3) and (1 – 0.85) or 0.7 x 0.15 = 0.105.

(We'll ignore for the moment the fact that the numbers don't add up.)

But the presumed probabilities that 1 and 2 are true positives are 0.65 and 0.25 = 0.1625, while the presumed probabilities that 3 and 4 are true negatives are 0.6 and 0.35 = 0.21.

Now we can ask two questions:

Why don't the numbers add up (or even come close). (The sens/spec numbers I used are from "real life".)
How should I decide which hypothesis is (most likely) true (in this example it seems to be "negative" for both calcs, but I'm not sure that's always the case), and what can I use for a "figure of merit" to decide if the result is "significant"?

More info

This is an attempt to refine and extend an existing "weighting" scheme that is entirely "artistic" in nature (ie, just pulled out of someone's a**). The current scheme is basically on the lines of "If any two of the first three are positive, and if two of the next four, and either of the next two, then assume positive." (That's a somewhat simplified example, of course.) The available statistics don't support that weighting scheme — even with a crude weighting algorithm based on the measured stats I come up with significantly different answers. But, absent a rigorous way of evaluating the stats I have no credibility.

Also, the current scheme only decides positive/negative, and I need to create a (statistically valid) "ambiguous" case in the middle, so some figure of merit is needed.

Latest

I've implemented a more-or-less "pure" Bayesian inference algorithm, and, after going round and round on several side issues, it seems to be working pretty well. Rather than working from specificities and sensitivities I derive the formula inputs directly from the true positive/false positive numbers. Unfortunately, this means that I can't use some of the better quality data that isn't presented in a way that allows these numbers to be extracted, but the algorithm is much cleaner, allows modification of the inputs with much less hand calculation, and it seems pretty stable and the results match "intuition" fairly well.

I've also come up with an "algorithm" (in the purely programming sense) to handle the interactions between interdependent observations. Basically, rather that looking for a sweeping formula, instead I keep for each observation a marginal probability multiplier that is modified as earlier observations are processed, based on a simple table — "If observation A is true then modify observation B's marginal probability by a factor of 1.2", eg. Not elegant, by any means, but serviceable, and it seems to be reasonably stable across a range of inputs.

(I'll award the bounty to what I deem to have been the most helpful post in a few hours, so if anyone wants to get a few licks in, have at it.)

Best Answer

"I'm wondering what is the best way to combine these numbers in a way that will yield a final score that is (hopefully) more reliable than any single test." A very common way is to compute Cronbach's alpha and, more generally, to perform what some would call a "standard" reliability analysis. This would show to what degree a given score correlates with the mean of the 17 other scores; which tests' scores might be best dropped from the scale; and what the internal consistency reliability is both with all 18 and with a given subset. Now, some of your comments seem to indicate that many of these 18 are uncorrelated; if that is true, you may end up with a scale that consists of just a few tests.

EDIT AFTER COMMENT: Another approach draws on the idea that there is a tradeoff between internal consistency and validity. The less correlated your tests are, the better their content coverage, which enhances content validity (if not reliability). So thinking along these lines you would ignore Cronbach's alpha and the related indicators of item-total correlation and instead use a priori reasoning to combine the 18 tests into a scale. Hopefully such a scale would correlate highly with your gold standard.

Confidence intervals using a normal approximation

Let $T = \frac{X/m}{Y/n}$, then the variable $\ln(T)$ isapproximately normally distributed with approximate mean $\ln(\theta)$ and estimated variance $\widehat{\sigma}^{2}=(1/x) - (1/m) + (1/y) - (1/n)$. An approximated two-sided $1-\alpha$ confidence interval for $\theta$ is given by: $$ \{t\cdot \exp(-\xi_{1-\alpha/2}\cdot\hat{\sigma}), t\cdot \exp(\xi_{1-\alpha/2}\cdot\hat{\sigma})\} $$ where $\xi_{1-\alpha/2}$ is the $1-\frac{1}{2}\alpha$ quantile of the standard normal distribution $\mathcal{N}(0,1)$ (for $\alpha = 0.05$ $\xi=1.96$) and $t$ is the observed value of $T$ (in your case, $t$ would be simply the observed likelihood ratios). In your case, $T$ is simply the likelihood ratios and $x=a, m=a+c, y=b, n=b+d$ for the positive LR or $x=c, m=a+c, y=d, n=b+d$ for the negative LR. Important: The paper also states procedures if $x=0, x=m, y=0$ or $y=n$:

Special cases

If you have no false-positives ($b=y=0$ in the table above), then you can just substitute $b=1/2$ to calculate the lower bound of the confidence interval.

In this post, @whuber provides a similar approach: add $1/2$ to both $x$ and $y$ and add $1$ to both the $n$ and $m$. In this particular case, add $1/2$ to $a$ and $b$ and add $1$ to $(a+c)$ and $(b+d)$ for the positive LR and add $1/2$ to $c$ and $d$ and add $1$ to $(a+c)$ and $(b+d)$ for the negative LR.

Another formulation

The above confidence interval can be written differently: $$ LR_{x}=\exp\left[\ln\left(\frac{p_1}{p_2}\right)\pm \xi_{1-\alpha/2}\cdot \sqrt{\frac{1-p_1}{p_{1}n_{1}}+\frac{1-p_2}{p_{2}n_{2}}} \right] $$ where $\xi_{1-\alpha/2}$ is the $1-\frac{1}{2}\alpha$ quantile of the standard normal distribution. Note: For the positive LR, $p_1=\text{sensitivity}$ and $p_2=1-\text{specificity}$ and for the negative LR, $p_1=1-\text{sensitivity}$ and $p_2=\text{specificity}$ and $n_1 = a+c, n_2=b+d$.

Here is a little R function that carries out the calculations:

lr.ci <- function( m, sig.level=0.95 ) {

  alpha <- 1 - sig.level

  a <- m[1, 1]
  b <- m[1, 2]
  c <- m[2, 1]
  d <- m[2, 2]

  spec <- d/(b+d)
  sens <- a/(a+c)

  lr.pos <- sens/(1 - spec)  

  if ( a != 0 & b != 0 ) {

    sigma2 <- (1/a) - (1/(a+c)) + (1/b) - (1/(b+d))

    lower.pos <- lr.pos * exp(-qnorm(1-(alpha/2))*sqrt(sigma2))

    upper.pos <- lr.pos * exp(qnorm(1-(alpha/2))*sqrt(sigma2)) 


  } else if ( a == 0 & b == 0 ) {

    lower.pos <- 0
    upper.pos <- Inf

  } else if ( a == 0 & b != 0 ) {

    a.temp <- (1/2)

    spec.temp <- d/(b+d)
    sens.temp <- a.temp/(a+c)

    lr.pos.temp <- sens.temp/(1 - spec.temp)  

    lower.pos <- 0

    sigma2 <- (1/a.temp) - (1/(a.temp+c)) + (1/b) - (1/(b+d))

    upper.pos <- lr.pos.temp * exp(qnorm(1-(alpha/2))*sqrt(sigma2))

  } else if ( a != 0 & b == 0 ) {

    b.temp <- (1/2)

    spec.temp <- d/(b.temp+d)
    sens.temp <- a/(a+c)

    lr.pos.temp <- sens.temp/(1 - spec.temp) 

    sigma2 <- (1/a) - (1/(a+c)) + (1/b.temp) - (1/(b.temp+d))

    lower.pos <- lr.pos.temp * exp(-qnorm(1-(alpha/2))*sqrt(sigma2))

    upper.pos <- Inf  

  } else if ( (a == (a+c)) & (b == (b+d)) ) {

    a.temp <- a - (1/2)
    b.temp <- b - (1/2)

    spec.temp <- d/(b.temp+d)
    sens.temp <- a.temp/(a+c)

    lr.pos.temp <- sens.temp/(1 - spec.temp) 

    sigma2 <- (1/a.temp) - (1/(a.temp+c)) + (1/b.temp) - (1/(b.temp+d))

    lower.pos <- lr.pos.temp * exp(-qnorm(1-(alpha/2))*sqrt(sigma2))

    upper.pos <- lr.pos.temp * exp(qnorm(1-(alpha/2))*sqrt(sigma2)) 

  }


  lr.neg <- (1 - sens)/spec

  if ( c != 0 & d != 0 ) {

    sigma2 <- (1/c) - (1/(a+c)) + (1/d) - (1/(b+d))

    lower.neg <- lr.neg * exp(-qnorm(1-(alpha/2))*sqrt(sigma2))

    upper.neg <- lr.neg * exp(qnorm(1-(alpha/2))*sqrt(sigma2)) 

  } else if ( c == 0 & d == 0 ) {

    lower.neg<- 0
    upper.neg <- Inf

  } else if ( c == 0 & d != 0 ) {

    c.temp <- (1/2)

    spec.temp <- d/(b+d)
    sens.temp <- a/(a+c.temp)

    lr.neg.temp <- (1 - sens.temp)/spec.temp    

    lower.neg <- 0

    sigma2 <- (1/c.temp) - (1/(a+c)) + (1/d) - (1/(b+d))

    upper.neg <- lr.neg.temp * exp(qnorm(1-(alpha/2))*sqrt(sigma2))

  } else if ( c != 0 & d == 0 ) {

    d.temp <- (1/2)

    spec.temp <- d.temp/(b+d)
    sens.temp <- a/(a+c)

    lr.neg.temp <- (1 - sens.temp)/spec.temp  

    sigma2 <- (1/c) - (1/(a+c)) + (1/d.temp) - (1/(b+d))

    lower.neg <- lr.neg.temp * exp(-qnorm(1-(alpha/2))*sqrt(sigma2))

    upper.neg <- Inf  

  } else if ( (c == (a+c)) & (d == (b+d)) ) {

    c.temp <- c - (1/2)
    d.temp <- d - (1/2)

    spec.temp <- d.temp/(b+d)
    sens.temp <- a/(a+c.temp)

    lr.neg.temp <- (1 - sens.temp)/spec.temp   

    sigma2 <- (1/c.temp) - (1/(a+c)) + (1/d.temp) - (1/(b+d))

    lower.neg <- lr.neg.temp * exp(-qnorm(1-(alpha/2))*sqrt(sigma2))

    upper.neg <- lr.neg.temp * exp(qnorm(1-(alpha/2))*sqrt(sigma2)) 

  }

  list(
    lr.pos=lr.pos, lower.pos=lower.pos, upper.pos=upper.pos,
    lr.neg=lr.neg, lower.neg=lower.neg, upper.neg=upper.neg
    )

}

Solved – Denominator is Zero for Matthews correlation coefficient and F-measure

This is only really a problem if you compute the precision and recall first, then plug them in.

One can also compute the $F_1$ score as $$F_1 = \frac{2 \cdot \textrm{True Positive}}{2 \cdot \textrm{True Positive} + \textrm{False Positive} + \textrm{False Negative}}$$

Plugging in your numbers, you'll arrive at an $F_1$score of zero, which seems appropriate since your classifier is just guessing the majority class.

There is an information-theoretic measure called proficiency that might be of interest if you are working on fairly unbalanced data sets. The idea is that you want it to remain sensitive to both classes as either the number of true positives or negatives approaches zero. It's essentially $$ \frac{I(\textrm{predicted labels}; \textrm{actual labels})}{H(\textrm{actual labels)}}$$

See pages 5--7 of White et al. (2004) for more details about its calculation and interpretation