Solved – Should we address multiple comparisons adjustments when using confidence intervals

confidence intervalinferencemultiple-comparisons

Suppose we have a multiple comparisons scenario such as post hoc inference on pairwise statistics, or like a multiple regression, where we are making a total of $m$ comparisons. Suppose also, that we would like to support inference in these multiples using confidence intervals.

1. Do we apply multiple comparison adjustments to CIs? That is, just as multiple comparisons compel a redefinition of $\alpha$ to either the family-wise error rate (FWER) or the false discovery rate (FDR), does the meaning of confidence (or credibility1, or uncertainty, or prediction, or inferential… pick your interval) get similarly altered by multiple comparisons? I realize that a negative answer here will moot my remaining questions.

2. Are there straightforward translations of multiple comparison adjustment procedures from hypothesis testing, to interval estimation? For example, would adjustments focus on changing the $\text{CI-level}$ term in the confidence interval: $\text{CI}_{\theta} = (\hat{\theta} \pm t_{(1-\text{CI-level)/2}}\hat{\sigma}_{\theta})$?

3. How would we address step-up or step-down control procedures for CIs? Some family-wise error rate adjustments from the hypothesis testing approach to inference are 'static' in that precisely the same adjustment is made to each separate inference. For example, the Bonferroni adjustment is made by altering rejection criterion from:

  • reject if $p\le \frac{\alpha}{2}$ to:
  • reject if $p\le \frac{\frac{\alpha}{2}}{m}$,

but the Holm-Bonferroni step-up adjustment is not 'static', but rather made by:

  • first ordering $p$-values smallest to largest, and then
  • reject if $p\le 1 – (1- \frac{\alpha}{2})^{\frac{1}{m+1-i}}$, (where $i$ indexes the ordering of the $p$-values) until
  • we fail to reject a null hypothesis, and automatically fail to reject all subsequent null hypotheses.

Because rejection/failure to reject is not happening with CIs (more formally, see the references below) does that mean that stepwise procedures don't translate (i.e. including all of the FDR methods)? I ought to caveat here that I am not asking how to translate CIs into hypothesis tests (the representatives of the 'visual hypothesis testing' literature cited below get at that non-trivial question).

4. What about any of those other intervals I mentioned parenthetically in 1?

1 Gosh, I sure hope I don't get in trouble with those rockin' the sweet, sweet Bayesian styles by using this word here. πŸ™‚

References
Afshartous, D. and Preston, R. (2010). Confidence intervals for dependent data: Equating non-overlap with statistical significance. Computational Statistics & Data Analysis, 54(10):2296–2305.

Cumming, G. (2009). Inference by eye: reading the overlap of independent confidence intervals. Statistics In Medicine, 28(2):205–220.

Payton, M. E., Greenstone, M. H., and Schenker, N. (2003). Overlapping confidence intervals or standard error intervals: What do they mean in terms of statistical significance? Journal of Insect Science, 3(34):1–6.

Tryon, W. W. and Lewis, C. (2008). An inferential confidence interval method of establishing statistical equivalence that corrects Tryon’s (2001) reduction factor. Psychological Methods, 13(3):272–277.

Best Answer

An excellent topic which is, sadly, not given enough attention.

When discussing multiple parameters and confidence intervals, a distinction should be made between simultaneous inference and selective inference. Ref.[2] gives an excellent demonstration of the matter.

Simultaneous confidence intervals mean that all the parameters are covered with $1-\alpha$ confidence.
Selective confidence intervals mean that a subset of selected parameters are covered.

These two concepts can be combined: Say you construct intervals only on parameters for which you rejected the null hypothesis. You are clearly dealing with selective inference. You may want to guarantee simultaneous coverage of selected parameters, or marginal coverage of selected parameters. The former would be the counterpart of FWER control, and the latter of FDR control.

Now more to the point: Not all testing procedures have their accompanying intervals. For FWER procedures and their accompanying intervals, see [3]. Sadly, this reference is a bit outdated. For the interval counterpart of BH FDR control, see [1] and an application in [4] (which also includes a brief review of the matter). Please note that this is a fresh and active research field so that you can expect more results in the near future.

[1] Benjamini, Y., and D. Yekutieli. β€œFalse Discovery Rate-Adjusted Multiple Confidence Intervals for Selected Parameters.” Journal of the American Statistical Association 100, no. 469 (2005): 71–81.

[2] Cox, D. R. β€œA Remark on Multiple Comparison Methods.” Technometrics 7, no. 2 (1965): 223–24.

[3] Hochberg, Y., and A. C. Tamhane. Multiple Comparison Procedures. New York, NY, USA: John Wiley & Sons, Inc., 1987.

[4] Rosenblatt, J. D., and Y. Benjamini. β€œSelective Correlations; Not Voodoo.” NeuroImage 103 (December 2014): 401–10.

Related Question