Solved – Does confidence interval for odds ratio assume log-normal distribution

confidence intervalnormal distributionodds-ratio

The formula floating around for calculating 95% confidence interval of an odds ratio is:

e^(log(OR) ± 1.96 x sqrt(1/a + 1/b + 1/c + 1/d))

Does one need to confirm the log-OR is normally distributed in my data in order to use this calculation? In what situations might log-OR not be normally distributed and how would one calculate a 95%CI then?

The 1.96 term clearly comes from the normal distribution. Would the standard error term be different if the log-OR was from some other distribution?

Best Answer

Given the comments, I have included the proof of the floating equation at the bottom of the response.


Given a two-by-two contingency table where the OR is $\frac{a/b}{c/d}$, if you take the log of both sides, the log odds ratio is $\log(a) - \log(b) - (\log(c) - \log(d))$. Hence the formula for the log odds ratio is additive. Because it is additive, we can assume that the log odds ratio converges to normality, much faster than the odds ratio, which has a multiplicative structure.

We can demonstrate the above using bootstrapping:

fake_dat <- data.frame(
  y = c(rep(1, 125), rep(0, 50), rep(1, 100), rep(0, 75)),
  g = c(rep("A", 175), rep("B", 175))
)
(mat <- table(fake_dat))
   g
y     A   B
  0  50  75
  1 125 100

Take A to be the treatment group, and B to be the control group.

# And the odds ratio is?
mat[2, 1] / mat[1, 1] / (mat[2, 2] / mat[1, 2])
[1] 1.875

We can resample the odds ratio and the log odds ratio from the data repeatedly and check their distributions:

or.s <- lor.s <- rep(NA, 3000)
for (i in 1:3000) {
  rand.samp <- sample(1:nrow(fake_dat), nrow(fake_dat), replace = TRUE)
  new_dat <- fake_dat[rand.samp, ]
  mat <- table(new_dat)
  or <- mat[2, 1] / mat[1, 1] / (mat[2, 2] / mat[1, 2])
  lor <- log(or)
  or.s[i] <- or
  lor.s[i] <- lor
}
par(mfrow = c(1, 2))
hist(or.s, main = "OR")
hist(lor.s, main = "LOR")
par(mfrow = c(1, 1))

enter image description here

We can see that at our sample size of 350, the estimate of the sampling distribution of the log odds ratio appears normally distributed. If you drop the sample size enough, you will arrive at a non-normally distributed estimate of the sampling distribution for the log odds ratio.

If we then apply the delta method to calculate the variance of the log odds ratio, we arrive at the equation you mentioned. And the delta method relies on the central limit theorem. So the lone requirement for normality is a large enough sample size, as the method is a large sample approximation.

One way to know if your sample size is large enough is to calculate a small sample odds ratio. One common small-sample formula for the odds ratio requires adding 0.5 to the cell counts before calculating the odds ratio and standard error in the same way. If the small sample odds ratio and confidence interval markedly differ from the standard odds ratio and CI, then the large sample approximation is probably inadequate. However, note that the small sample method is best used as a diagnostic rather than seen as a solution.

You can find coverage of these topics in chapter 2 of Agresti's Categorical Data Analysis and chapter 7 of Jewell's Estimation and Inference for Measures of Association. For the delta method, a good guide is Powell's Approximating variance of demographic parameters using the DELTA METHOD.


Origin of equation in question:

If we suppose that {$n_i, i = 1,2,3,4$} have a multinomial $(n, \pi_i)$ distribution, the quantity $\hat\pi_i$ has mean and variance:

\begin{equation} \mathrm{E}(\hat\pi_i)=\pi_i \quad\mathrm{and}\quad \mathrm{var}(\hat\pi_i)=\pi_i(1-\pi_i)/n=(\pi_i-\pi_i^2)/n \end{equation}

and for $i \neq j$:

\begin{equation} \mathrm{cov(\hat\pi_i,\hat\pi_j)}=-\pi_i\pi_j/n \end{equation}

We know that:

\begin{equation} OR = \frac{a / b}{c/d}=\frac{n\pi_a\times n\pi_d}{n\pi_b\times n\pi_c}=\frac{\pi_a\times \pi_d}{\pi_b\times \pi_c} \end{equation}

Then, $\log(OR) = \log(\pi_a) + \log(\pi_d) - \log(\pi_b) - \log(\pi_c)$.

For the delta method, we have the equation,

\begin{equation} \mathrm{var}(G)=\mathrm{var}[f(X_1,X_2,...,X_n)]\\ =\sum_{i=1}^n\mathrm{var}(X_i)\big[f'(X_i)\big]^2 + 2\sum_{i=1}^n\sum_{j=1}^n\mathrm{cov}(X_i,X_j)\big[f'(X_i)f'(X_j)\big] \end{equation}

Given this equation, the variance of $\big(\log{OR} = \log(\pi_a) + \log(\pi_d) - \log(\pi_b) - \log(\pi_c)\big)$ is:

\begin{equation} \frac{1}{\pi_a^2}\frac{\pi_a-\pi_a^2}{n} + \frac{1}{\pi_d^2}\frac{\pi_d-\pi_d^2}{n} + \frac{1}{\pi_b^2}\frac{\pi_b-\pi_b^2}{n} + \frac{1}{\pi_c^2}\frac{\pi_c-\pi_c^2}{n}\\ - \frac{2}{n}\frac{\pi_a\pi_d}{\pi_a\pi_d} + \frac{2}{n}\frac{\pi_a\pi_b}{\pi_a\pi_b} + \frac{2}{n}\frac{\pi_a\pi_c}{\pi_a\pi_c} + \frac{2}{n}\frac{\pi_d\pi_b}{\pi_d\pi_b} + \frac{2}{n}\frac{\pi_d\pi_c}{\pi_d\pi_c} - \frac{2}{n}\frac{\pi_b\pi_c}{\pi_b\pi_c}\\ = \frac{1}{n\pi_a}-\frac{1}{n} +\frac{1}{n\pi_b}-\frac{1}{n} +\frac{1}{n\pi_c}-\frac{1}{n} +\frac{1}{n\pi_d}-\frac{1}{n} +\frac{4}{n}\\ = \frac{1}{n\pi_a} +\frac{1}{n\pi_b} +\frac{1}{n\pi_c} +\frac{1}{n\pi_d}-\frac{4}{n}+\frac{4}{n}\\ =\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d} \end{equation}