I have some experimental data that I'm trying to analyze.
I have 1 response variable and 3 explanatory variables (these are factor variables).
The explanatory variables are the presence of a disease(positive and negative),
a genetic profile (X and Y), and whether or not an MRI contrast agent was given
(YES and NO).
Structure of data looks like this:
measurement profile disease contrast
1 -1.76269 X NEG YES
2 -0.34492 X NEG NO
3 0.57455 X POS YES
4 2.16539 X POS NO
. . . .
. . . .
. . . .
77 -1.76269 Y NEG YES
78 -0.34492 Y NEG NO
79 0.57455 Y POS YES
80 2.16539 Y POS NO
I looked into using ANOVA for this analysis but the post hoc Tukey HSD
looks at all possible combinations of the explanatory variables so it
makes far more comparisons than I actually care about.
We have some specific hypotheses that, e.g.:
X.NEG.NO will differ from Y.NEG.NO,
X.NEG.NO will differ form X.NEG.YES,
X.NEG.NO will differ from X.POS.NO, etc.
(notice that each compared group "consist" from interaction of all three variables)
How to get only some specific comparisons from TukeyHSD?
Is appropriate to make
Is there a better approach?
Reproducible example:
my.data <- data.frame(measurement = rnorm(80),
my.profile = rep(c("X","Y"), each = 40),
my.disease = rep(c("NEG","NEG","POS","POS"), times=20),
my.contrast = rep(c("NO","YES"), times = 40))
Best Answer
Suppose that we have one response variable and one explanatory variable (5 levels).
If we do ANOVA followed by classical TukeyHSD post-hoc...
...we obtain all possible combinations.
If you want to make only comparison that are interesting to you, use CONTRASTS.
In R there are several default contrast matrices (overview):
treatment, helmert, sum , polynomial... but you can create your own.
First - decide which comparisons you are interested in.
Then create a contrast matrix:
look at this table for reference:
Take notice that sum of each column is equal to 0.
If you have 5 levels in a factor, there can be only 4 comparisons, due to degrees of freedom.
In the first column you compare mean of "A" with mean of all others categories.
In the second column you compare only category "B" with "C" (B1+B2 vs. C1+C2).
In the third column you compare only within "C" category (C1 vs. C2).
In the fourth column you compare only within "B" category (B1 vs. B2).
To see the results re-make the ANOVA with created contrast matrix.
...and each row in the table corresponds to each comparison (column in contrast matrix) made. (i.e. con.data$category1 is significant so there is significant difference between mean of "A" vs. mean of all others groups...etc.)
In short:
Try to make a contrast matrix containing only comparisons you are interested in. With the example above it should not be difficult.
However !!!
I would not use post-hoc (or contrasts) on data immediately. It is like to teach a model to "run" before it can "walk". So as a first thing I would create a model containing all variables. Subsequently, remove all non-significant variables (or their interaction) according to marginality rule. This procedure (reduction) will determine if all your desired comparisons are necessary.
So try to make full model:
For example, if factor "disease" will not be significant (alone or in interaction - it should not be included in post-hoc.
Suppose the results of m1 model will look like this:
Use the rule of marginality:
update no. 1: you can see that triple interaction is non-significant - let's remove it
Now, show the anova of updated model:
update no. 2: you can see that double interactions are non-significant - let's remove them (start with double interaction with highest p-value)
update no. 3: remove another double interaction
update no. 4: remove the last double interaction
update no. 5: remove non-significant factor
Model m6 is our final model.
Unfortunately it is obvious that making comparisons (Y.NEG.NO, X.NEG.NO and others) has no sense, because the triple and double interaction are non-significant as well. And it would not be correct to select the desired rows from TukeyHSD (even if such post-hoc will show significant difference !!!). Believe me, such approach will be very hard to defend in peer-review process. So you can make only comparison in profile (X vs. Y) and contrast (NO vs. YES). Disease factor is non-significant. Do not be sad - even non-significant result is a result.
Marginality rule is the practical application of Occam's razor (See also in Crawley, M.J. Statistics: An Introduction Using R. 2nd ed. Wiley, 2014, Ch. 10 Multiple Regression, p. 195). A good model is always the simplest one - and explains the largest portion of variability in data.
You can publish this result in an article as a "full model" (containing all the factors and their interaction(s)) or as a "minimal adequate model" (MAM, containing only significant effects). I would prefer to include both versions into a manuscript and let reviewers to decide which one to prefer.
The point is not to fishing for p-values in post-hoc tests when ANOVA results are non-significant.