Solved – How to derive 2×2 cell counts from contingency table margins and the odds ratio

contingency tablesepidemiologyodds-ratiosimultaneous equation

I'm certain there's a unique solution to this, and I think I've worked it out before but now it has me pulling my hair out:

Given the margins of a 2×2 contingency table such as the prevalences of a binary exposure and a binary outcome, as well as their odds ratio, how can you calculate the four cell proportions (i.e. the reverse of the usual problem of calculating an odds ratio)? e.g.

                outcome     no outcome
exposed         a           b               Pr(E)
not exposed     c           d               Pr(Ê)=1-Pr(E)
                Pr(O)       Pr(Ô)=1-Pr(O)   100%

given the equations:

a + b = Pr(E)
b + c = Pr(Ê) = 1 – Pr(E)
a + c = Pr(O)
b + d = Pr(Ô) = 1 – Pr(O)
(ad) / (bc) = OR

and the constraints that all cell values and odds ratio are positive, and that a + b + c + d sum to 100% (i.e. everything is expressed as as proportion, although you can expand the equations to include an arbitrary known population count if you wish) express each cell value in terms of Pr(E), Pr(O) and OR.

Best Answer

Write $\rho$ for the odds ratio, $\beta=\Pr(E)$, $\gamma=\Pr(O)$. Four independent equations are

$$\cases{a+b=\beta \\ a+c=\gamma \\ a+b+c+d=1 \\ ad = \rho bc.}$$

Adding the first two shows

$$b+c = \beta + \gamma - 2a. \tag{1}$$

Multiplying the first two equations and using $(1)$ yields

$$bc = \beta\gamma - (\beta+\gamma)a + a^2.\tag{2}$$

Multiplying the third equation by $a$, using the fourth to re-express $ad$ and plugging in $(1)$ and $(2)$ gives

$$a^2 + a(\beta+\gamma-2a) + \rho(\beta\gamma - (\beta+\gamma)a + a^2) = a.$$

In a more standard form, this a zero of the quadratic

$$(\rho-1)a^2 + [(\beta+\gamma)(1-\rho)-1]a + \rho\beta\gamma.\tag{3}$$

Provided $\beta, \gamma,\rho$ are consistent with some $2\times 2$ table, there will be at least one real zero of $(3)$, easily found using any quadratic formula. For either zero, solutions for the remaining entries are readily found from the first three of the original equations as

$$\cases{b=\beta-a\\c=\gamma-a\\d = 1+a-\beta-\gamma.}$$


There will be at most one valid solution for $a$, determined by the non-negativity of the coefficients. Here is an R implementation of the solution as the function f along with a test using randomly generated tables. The test outputs the random table, its reconstructed value from f, and a measure of the difference between them. By wrapping the test in replicate, I have run it 10,000 times. The final output gives the largest difference found: up to floating point error it equals zero, demonstrating the correctness of this approach.

f <- function(beta, gamma, rho, eps=1e-15) {
  a <- rho-1
  b <- (beta+gamma)*(1-rho)-1
  c_ <- rho*beta*gamma

  if (abs(a) < eps) {
    z <- -c_ / b
  } else {
    d <- b^2 - 4*a*c_
    if (d < eps*eps) s <- 0 else s <- c(-1,1)
    z <- (-b + s*sqrt(max(0, d))) / (2*a)
  }
  y <- vapply(z, function(a) zapsmall(matrix(c(a, gamma-a, beta-a, 1+a-beta-gamma), 2, 2)),
              matrix(0.0, 2, 2))
  i <- apply(y, 3, function(u) all(u >= 0))
  return(y[,,i])
}
set.seed(17)
i<-0
sim <- replicate(1e4, {
  while(TRUE) {
    x <- matrix(round(rexp(4), 2), 2, 2)
    if(all(rowSums(x) > 0) && all(colSums(x) > 0) && x[1,2]*x[2,1] > 0) break
  }
  x <- x / sum(x)
  beta <- rowSums(x)[1]
  gamma <- colSums(x)[1]
  rho <- x[1,1]*x[2,2] / (x[1,2]*x[2,1])

  y<- f(beta, gamma, rho)
  delta <- try(zapsmall(c(1, sqrt(crossprod(as.vector(x-y)))))[2])
  if ("try-error" %in% class(delta)) cat("Error processing ", x, "\n")
  delta
})
max(sim)
Related Question