FDR Adjusted p value after multiple regression with categorical predictor with many levels

multiple regressionmultiple-comparisons

I would like to run a multiple linear regression to test if my protein of interest differs among different stages of a disease, while controlling for covariates/ confounders such as age and gender. The disease stage variable has four levels. I would like to compare the "control" stage with the three remaining stages. From what I understand, once I specify the disease stage variable as categorical and specify the reference group ("control"), the multiple regression output will compare the mean of the "control" group with each disease stage. I believe this is similar to running an ANCOVA. However, in ANCOVA analyses, a post-hoc test with an adjustment for multiple comparisons is performed (such as Bonferroni). I am aware that some statistical softwares (such as SPSS) perform post-hoc comparisons on unadjusted group means.

I have noticed that my predictor "disease stage" is significant in my multiple regression analysis. Therefore, I am wondering if it is possible to run a regression each time I would like to separately compare the mean of the control group to another, while controlling for the covariates. Finally, I would apply the Benjamini-Hochberg (FDR) procedure to the p values for the group differences. I understand some softwares (such as JMP) provide the regression output with all the comparisons with the reference group. However, I would think that it would be more appropriate to exclude certain pairwise comparisons from the regression. Such that we only focus on one comparison in the "post-hoc" regression model? I hope my last point makes sense.

Thank you

Best Answer

I have noticed that my predictor "disease stage" is significant in my multiple regression analysis. Therefore, I am wondering if it is possible to run a regression each time I would like to separately compare the mean of the control group to another, while controlling for the covariates.

You are much better off doing a single regression with your multi-category disease stage predictor and the covariates. With the "control" group as the reference for that predictor, you will get point estimates and standard errors/confidence intervals for the difference of each of the 3 other stages versus "control" in a single analysis that uses all of your data together. Such results are typically reported directly in a table of regression coefficients.

One could also report a multiple-comparison correction. Note that the Bonferroni correction is unnecessarily conservative; see the discussion in the help page for the p.adjust() function in the basic R stats package, which provides a simple way to apply any of several methods for p-value correction. If you have a specific set of pre-specified comparisons, there is no need to perform all pairwise tests and the correction need only be applied to the pre-specified comparisons.

Breaking this apart into separate models for each disease stage category versus "control" throws away potentially informative data and loses power. If you think that the associations of some covariates with outcome differs depending on disease stage then it's generally better to include interaction terms for those covariates with disease stage rather than do separate models. That takes advantage of all your data at once, giving you maximum power to detect true effects.

Related Question