Hypothesis Testing – Analyzing Repeated Measures Experiment with Multiple Treatment Groups and Multiple Measures

anovahypothesis testingkruskal-wallis test”post-hocrepeated measures

I am doing a study to evaluate the effect of two dietary supplements on body composition.

Literature suggests that Treatment 1 (Dietary Supplement 1) affects body composition with some side effects on a few blood profiles. We would like to know if Treatment 2 (Dietary Supplement 2) has the same effect (or even better) on body composition than Treatment 1, while not having those adverse effects on blood profiles.

Here is the dataset summary:

Total Number of Subjects: 171
Total Number of Groups: 3
- Control, 56 Subjects
- Treatment 1, 58 Subjects
- Treatment 2, 57 Subjects
There are 12 repeated measures (each done 3 times), so 36 columns in total. The measures include Body Weight, Body Mass Index, Body Fat Mass, and bunch of blood profiles like HDL, LDL, etc. Hence, some measures are related (Body weight and BMI) while others aren't. It seems (see below) that some measures follow Gaussian distribution and some do not.
All these measurements were done 3 times during the study.

Assuming A is body weight. The body weight is measured at three time points during the study. Therefore, I have three body weight measures A1, A2, and A3 for all groups (Control, Treatment 1, Treatment 2). Thus, the sample data-set looks like (each row represents a subject):

Here are the problems:

For within-group test, I've found that I should use repeated measures ANOVA with Bonferroni correction as post-hoc test for Gaussian measures and Friedman's test with Wilcoxon signed-rank (for each pair) as post-hoc test for non-Gaussian measures. Are these correct choices (particularly the post-hoc ones)?
For between-group tests, should I use one-way ANOVA with Bonferroni correction as post-hoc test for Gaussian, and Kruskal-Wallis with Dunn's test as post-hoc for non-Gaussian measures?
For this data-set, what does the output of post-hoc analysis of between-group effects mean exactly? For example, if B3 is significantly different between groups, does the post-hoc show which group's B3 is significantly different from which other group?
If I want to report all within- and between-groups effects for each measurement along with post-hoc results, that would be a lot! Any advise on reporting analyses on such data-sets?

Normality tests

I used Shapiro-Wilk test in R to test the normality of each measure for each group. Here are histograms for B1 in two groups (as an example of non-normal data):

UPDATE:

We are not 100% sure that Treatment 1 is effective (there is some controversy in the literature about how effective it is), so we included the Control group to see whether Treatment 1 is even effective, then compare it to Treatment 2.
About the time points, there are three of them (at the beginning, after 6 weeks, and 12 weeks). This is to see how fast or slow are the effects of both treatments, compared to each other and also compared to placebo.
While body composition measurements sound depended, it is not always the case. One could loose weight without her BFM changing at all, or it could be solely because of change in BFM.

Best Answer

I think one could write a whole book dealing exclusively with your question (and I am definitely not qualified to write it). So without any attempt at providing a comprehensive answer, here are some points that can hopefully be helpful.

Confirmatory vs. exploratory approach to analysis

As you note yourself, you have a very rich dataset and you can test a lot of things. We can quickly compute the number of meaningful tests: you have $12$ measures; each was measured $3$ times in $3$ groups. So if we count all pairwise tests, it will be $3$ tests per group and $3$ tests per measurement time, i.e. $18$ tests per measure, i.e. $216$ tests. You are obviously aware of the lurking multiple comparisons problem (remember the green beans comic?), but if you are normally happy to use $\alpha=0.05$ and were to use e.g. Bonferroni adjustment then you would have to use $\alpha = 0.05/216\approx 0.002$ and to risk not finding any significant effects because you do not have enough power.

This is of course not a unique, but in fact a very common situation.

Broadly speaking, you can adopt one of the two approaches.

Confirmatory approach insists on strict adherence to the rules of significance testing. You should formulate your one or several (but as few as possible) research hypotheses in advance and carefully plan which statistical tests you are going to carry out. To mitigate the multiple comparisons / low power problem, you should try to design your tests such that you use as few tests as possible while having maximal power to detect what you really want to detect. For example, you might want to combine your measures into some composite or pooled measures that are likely to be most affected by Treatment 1 or 2. Or you can pool over measurement times. Etc. In any case, you try to boil down all your data to a couple of crucial comparisons, and then you do only those, applying Bonferroni (or similar) adjustment. It's important that all of that is planned before you have ever looked at the data (because after looking at the data you will be tempted to change your tests).

Alas, in practice, this is often hardly possible.

Exploratory approach, in contrast, is like biting the bullet: you have a lot of rich data, so why not explore all sorts of relationships that are present in there. You will do lots of comparisons and lots of tests, you will adjust your analysis strategy depending on what you see in the data, but whatever -- this is all exploratory. You cannot do that if you are doing a clinical trial, but in more basic research this if often the only way to go. All $p$-values that you get out of this approach should be taken with a (big) grain of salt, though. In fact, some would say that you should not run or report any significance tests at all, but usually tests are still done. There is a good argument not to use multiple comparisons adjustments (such as Bonferroni) at all, and rather treat all the $p$-values as indicating strength of evidence in the Fisherian sence (as opposed to leading to a yes/no decision in the Neyman-Pearson sence).

Statistical tests if you are willing to assume normality

Let's for the moment ignore the issue of normality (see below) and assume that everything is normal. You have the following battery of tests:

For each measure, within-group pairwise comparison between two measurement times is a paired t-test. It will test if the measurements differ between these two times.
For each measure, between-group pairwise comparison for one measurement time is an unpaired t-test. It will test if these two groups differ on this specific measurement.
For each measure, within-group comparison between all three different measurement times is a repeated measures ANOVA. It will test if measurement time has any effect at all.
For each measure, between-group comparison between for one fixed measurement time, is a one-way ANOVA. It will test if groups differ in any way between each other.
For each measure, comparison between all groups and all times is a two-way repeated measures ANOVA. It will test if there is a significant effect of group, significant effect of time, and significant interaction between them.
For all measures, comparison between all groups and all times is a two-way repeated measures MANOVA. It will test if there is a significant effect of group, significant effect of time, or significant interaction between them on all measures taken together.

Note that #1 and #2 can be seen as a post-hocs to #3 and #4 respectively, #3 and #4 can be seen as post-hocs to #5, and #5 can be seen as post-hoc to #6.

[With an additional complication then when these tests are done as post-hocs they use some of the pooled estimates of the "parent" test in order to be more consistent with it; I am not sure though if these procedures exist on the higher levels of the hierarchy.]

So you have a layered structure and you can proceed in the top-down manner from the most general (#6) level down to most specific (#1 and #2) tests and run each next level only if you have significant omnibus effect on the higher level (apologies for the potential confusion; "higher" levels have higher numbers in my list and hence are located on the bottom of it... "top-down" means starting with MANOVA in #6 and going until t-tests in #1 and #2). This should protect you from false positives on the lower level, and so you arguably (!) don't need to do multiple comparison adjustments on the lower level (but as far as I understand, opinions on that differ).

You can also start directly at some middle layer and e.g. run 12 times #5 without doing #6, or 36 times #3 and 36 times #4 without doing #5. In confirmatory framework, you must then apply some multiple comparison correction (such as Bonferroni or rather Holm-Bonferroni). In exploratory framework this is not necessary, see above (example: maybe without adjustment you get $p=0.01$ effect in many different measures and it is very consistent; you are probably looking at a real effect then, but if you make Bonferroni adjustment then everything will stop being significant -- too bad. Instead, in exploratory framework you should rather keep $p=0.01$ as is and use your own expert judgment, but of course at your own risk).

By the way, if your Treatments work at all, you should expect significant effect of interaction in #6 and #5, so these are (hopefully!) almost guaranteed, and the interesting stuff begins at layers #3 and #4. If there is a real danger that both Treatments are as bad as placebo then perhaps you should really start with #6.

Another remark: a more "modern" approach would be to use a linear mixed model (with subjects being a random effect) instead of repeated measures ANOVA, but that's a whole other topic that I am not very familiar with. It would be great if somebody posted an answer here written from a mixed models perspective.

Statistical tests if you are not willing to assume normality

There are ranked analogues of most of these tests, but not of all of them. The analogues are as follows:

Wilcoxon test
Mann-Whitney-Wilcoxon test
Friedman test
Kruskal-Wallis test
?? (probably does not exist)
???? (most probably does not exist, but see here)

Additional complication is that post-hocs become tricky. Proper post-hoc to Kruskal-Wallis is not Mann-Whitney-Wilcoxon but the Dunn's test [it takes into account the issue that I mentioned in the square brackets above]. Similarly, proper post-hoc to Friedman is not Wilcoxon; not sure if it exists but if it does it is even more obscure than Dunn's.

Normality testing

It is in general a very bad idea to test for normality in order to decide whether you should use parametric or nonparametric tests. It will affect your $p$-values in an unpredictable way. At least in the confirmatory paradigm, you should decide on the test prior to looking at the data; if you have doubts about normality approximation, then rather don't use it. See here for more discussion: Choosing a statistical test based on the outcome of another (e.g. normality).

In your case, this means that you should use only parametric tests or only nonparametric tests for all measures (unless you have a priori grounds to suspect substantial deviations from normality in only a specific subset of measures; this does not seem to be the case).

In simple cases people often suggest to use ranked tests because they are powerful, simple, and you don't need to worry about the assumptions. But in your case, nonparametric tests will be a mess so you have a good argument in favour of classical ANOVAs. By the way, the histograms that you posted look "normal enough" to me that with your sample size you should not worry too much about them not being normal.

Data presentation

I would strongly advice to rely on visualization as opposed to only listing hundreds of $p$-values in a text or a table. With the data like that, first thing I would do (note: this is very exploratory!), would be to make a giant figure with 12 subplots, where each subplot corresponds to one measure and shows time on the x-axis (three measurements) and groups as lines of different color (with error bars).

Then just stare at this figure for really long and try to see if it makes sense. Hopefully the effects will be consistent across measures, across time points, etc. I would make this figure the main figure of the paper.

If you like, you can then pepper this figure with the results of your statistical tests (mark significant differences with stars).

Brief answers to your specific questions

Yes (almost -- see the caveat about Wilcoxon as post-hoc)
Yes
Yes
Use figures as much as you can.

Word of caution

We would like to know if Treatment 2 (Dietary Supplement 2) has the same effect (or even better) on body composition than Treatment 1, while not having those adverse effects on blood profiles.

To show that Treatment 2 does not have as much adverse effects as Treatment 1, it's not enough to show that there is significant difference between T1 and Controls but no significant difference between T2 and Controls. This is a common mistake. You actually need to show significant difference between T2 and T1.

Further reading:

Multiple Comparisons with Repeated Measures -- tutorial focused on SPSS but with a really good discussion.