Solved – Handling outliers when comparing two means in a repeated measures design

outliersrepeated measurest-test

I am doing a simple study that involved taking a measure at time point 1 and time point 2 (12 weeks later). While the sample was a class, not all members were present at both time points, so I have 20 date points at time 1 and 21 date points at time 2. The measure has a score, and I am taking the means and doing a simple t-test to determine if the intervention caused any increase in the measure at time point 2.

Questions: Do I need to throw out outliers if they are more than 2 standard deviations higher than the mean?

When I do the t-test, do I need to look at one- or two-tailed distributions? My hypothesis is that the intervention will increase the mean at time 2, so I think I should consider a one-tailed distribution.

Lastly, I am assuming that I have to do a paired t-test since it is a repeated measures design.

Best Answer

I will take these out of order. If it is possible to establish a correspondence between the measurements in the first set and the measurements in the second set (for example, Bob's score at time 1 and Bob's score at time 2 correspond because they both came from Bob), then you should do a paired t-test. That is, you should not calculate means for each time, but take differences, and calculate the mean and standard deviation of the differences. The standard error of the differences (i.e., the denominator of the t-statistic) is that standard deviation divided by $\sqrt{n}$. If some students did not participate at one of the occasions, then their scores should be set aside. Furthermore, you do not care if a score is more than 2 s.d.'s higher than the mean, although you may care if one of your differences is more than 2 s.d.'s above the mean of the differences.

The definition of an outlier is a data point that came from a different population than the one you want to be studying. The definition is not a data point that is far away from the rest of your data. However, we almost never know whether or not a data point came from a different distribution than the rest of our data, except that it looks really different. If you should ever spend much time conducting simulations, you will come to notice that every so often a data point that you know comes from the same distribution (because you wrote the simulation code) looks quite a bit different from the rest. This is an uncomfortable fact, but it is nonetheless true. Ultimately, you need to decide whether you believe that data point belongs there or not. There are some (potentially) helpful guidelines:

  1. With ~20 data points, a z-score with an absolute value greater than 2 is pretty unlikely (although it wouldn't be if you had, say, 100 data points);
  2. You can look at a plot of your data (e.g., a histogram) to see if the larger value is contiguous with the rest of your data, or if there is a large break between it and the rest;
  3. It can help to run your analysis both with the potential outlier and without it (often, you will get the same answer both ways, and that's reassuring);
  4. A final possibility is to use 'trimmed samples', that is, exclude the top and bottom 2 data points (given that you have ~20, this would be a 10% trimmed sample), note that this lessens your power, but many people think it's more even handed.

In the end, I'm afraid, you will still have to make a decision, however.

Lastly, you should know that the question of 2 vs. 1 - tailed t-tests has long been a contentious topic. It is probably not as important as people have made it out to be, but that is the nature of these things. Personally, I'm against 1-tailed tests, but my opinion is really unimportant. A question you could ask yourself is:

What if I find that the mean decreased by a very large amount? Would I say, 'Nope, there was no change', or would I say there was a change?

If it would be possible for you to believe a negative change if the data supported it, then you really should be using a 2-tailed test, but if there is no way you would ever believe the mean went down, then a 1-tailed test is probably fine, and you just let the old grumps (like me) harrumph about it. What you should not do is run the test both ways and pick the one that gives you the result you like best (or run a 1-tailed test, notice that the mean went down by a lot, then run a 2-tailed test and call it 'significant').

Related Question