Solved – Skew in both directions and dealing with outliers

anovaoutliersskewness

I have a lot of wonderfully messy data (got to love the social sciences), and realized that I was not fully prepared to bear its wrath.

For the record, after reading some articles regarding ANOVA, I am extremely paranoid about violating its assumptions. So I am trying to do my best in cleaning the data. I have also, in instances where assumptions have been violated, also run a nonparametric counterpart to assess whether the results are similar. I'm not entirely sure whether this is an acceptable practice? Or, if so, how this can be reported in my results.

A bit of background: my design is set up for a 2 x 5 mixed factorial (the between group is age – young/old – and the within groups factor is different forms of input methods).

Questions:

  1. In looking at my data (especially preference data) for the conditions, most of the data is skewed. The problem is that most of the distributions are positively skewed, while one or two are negatively skewed. Not quite sure what to do with this? Everything I have read has suggested that ANOVA is not a viable option under this circumstance. And some have also suggested that nonparametric tests are also not very powerful in this case. What to do?

  2. I have had to transform my data to help with normality and variance issues, but still have outliers in some cases. (I have looked at the outliers and reasoned through their existence. Some people just performed very well/poorly or possibly were harsher raters than the average joe). I was just taking the extreme scores that fell above/below my threshold (defined below) and recoded them as one more (or less) the next most extreme score. My questions are: is it OK to recode outliers after the transformation? If so, what's the best way to do that, say, after a square root transformation (obviously making it one more or one less than the next most extreme score is not the way to go as with untransformed data)?

  3. Originally, I started identifying outliers (in addition to normality plots and boxplots) as +/- 2.24 SD from the mean. After doing some reading, I thing that discovering outliers using the median absolute deviation might be best for my distributions. Just to be clear, though, after finding the median of the deviations of the historical scores from the median for that condition, do you just multiply that by 3 and consider all scores greater than/less than that value an outlier?

Sorry, these are probably basic questions. I appreciate any feedback/advice! Thank you!

EDIT: My dependent variables are different. Most of them are continuous (performance in WPM, and also likert-type response scales). Another type of variable is error rates (percentages).

EDIT #2: My design is 2 (between groups – age) x 5 (within groups – input method type). There are 25 people in each group (young/old) so there are 25 people in each cell in my design. It was counterbalanced for order effects.
To be more specific, my dependent variables are:
1. Adjusted Words per Minute (this is an average for each person over 15 trials) for each of 5 input methods. Along these lines, I also captured 3 different error rates (also averaged over 15 trials for each person).
2. Subjective ratings for each of the input methods: satisfaction (using the System Usability Scale, typical in our field, which is continuous, a composite score derived from 10 questions), workload (using the NASA-Task Load Index, again, continuous), pre- and post-test preference (0-50 point scale), post-test perceptions of accuracy (same 0 – 50 point scale). Also the Attitudes Toward Computers Questionnaire to assess attitudes toward technology (composites).
3. Individual measures (which I plan to correlate to some of the preference/performance measures). Individual baseline measures are all continuous, either motor speed (finger tapping test), speech rate, cognitive tests (Digit Symbol Substitution, etc.). Also some anthropometric measurements (finger circumference, hand breadth, etc.). Many of those are just going to be descriptives about my sample.

Best Answer

For the record, after reading some articles regarding ANOVA, I am extremely paranoid about violating its assumptions

Which articles? What do they say?

So I am trying to do my best in cleaning the data

What activities do you encompass when you say 'cleaning the data'?

I have also, in instances where assumptions have been violated, also run a nonparametric counterpart to assess whether the results are similar. I'm not entirely sure whether this is an acceptable practice?

Well, to me it seems wise - but you should work out how you will act under the various outcomes. What if they give different results? What do you do? (e.g. If you will pay attention to the nonparametric test when they differ, why use a parametric test at all?)

What nonparametric procedure did you have in mind?

Or, if so, how this can be reported in my results.

Perhaps its best to work out how you will use it, and then maybe you can consider how that will affect things like significance (and maybe power) before you do it; so that you can choose the approach without already having impacted your results.

In looking at my data (especially preference data) for the conditions, most of the data is skewed.

Which assumption of ANOVA do you think is violated?

most of the distributions are positively skewed, while one or two are negatively skewed.

Sounds like maybe your response has a strongly bounded range - can you say more about what your response is measuring and how you measure it?

Everything I have read has suggested that ANOVA is not a viable option under this circumstance.

What have you read, and what does it say?

Let's imagine there was an assumption that this violated. Do you know what the consequences of the violation would be? Can you live with those consequences?

And some have also suggested that nonparametric tests are also not very powerful in this case.

Generally, nonparametric approaches have assumptions too. What are you using?

have had to transform my data to help with normality and variance issues, but still have outliers in some cases.

What normality issues? Why is transformation the right approach?

This is the first mention of variance issues. Can you describe them? What assumption are you violating and why do you think it was violated?

I was just taking the extreme scores that fell above/below my threshold (defined below) and recoded them as one more (or less) the next most extreme score.

Why do this? Changing your numbers in an arbitrary fashion sounds like a bad idea to me.

How do you know that the impact on your inference isn't far worse than not doing this?

(obviously making it one more or one less than the next most extreme score is not the way to go as with untransformed data)

Who ever told you to do this?

--

It sounds like you've been spooked by some articles, and that hasn't led you toward doing things that make much sense to me. The actions you've chosen to take as a result seem, frankly, potentially much worse than doing nothing.

The very first thing to understand is that normality of your collection of data is not an assumption of ANOVA at any stage.

What is assumed to be normal are the error terms in your model (which you estimate by residuals).

In any case, non-normality doesn't mean that your estimates of effects are wrong; the only effect is on your inference (it can impact significance levels and power). It may be better to try to quantify that effect than take drastic actions whose own impact on every single aspect of your inference is unknown but at the least your estimates of effects will be biased.

You may be better off dealing with the data you have - skewed and with outliers - and try to find ways of dealing with those.

The first thing is probably to fit a model and see how the residuals behave, exactly as whuber suggested; it may be that your assumptions aren't unreasonable at all.

If that doesn't work out, a nonparametric procedure might be better - depending on its own assumptions. One possibility might be to do some form of bootstrapping, or other resampling procedure.

Or perhaps some form of GLM would be suitable, in some circumstances (depends on what sort of data you collected).

Edit:

-- if your 'preferences' are counts this may in fact be the best approach from the start; it would also be a potential explanation for the changing skewness (and would actually be expected to occur with a binomial model).

(end edit)

Or perhaps some transformation that's more in keeping with whatever form of data you have - but you have to be careful with transformations - because the interpretations of effects and so on can become tricky.

--

Edit 2:

"Articles" is vague, I apologize. In general, I've been reading articles and text books about ANOVA and some of its nonparametric counterparts (e.g., the Kirk Experimental Design book, a quantitative "best practices" book by Osborne, a paper by Wilcox, etc.).

The only one of those I've heard of is Wilcox (assuming it's either Rand or Paul). Which paper?

What things do each of these people say that you see as relevant? (one or two quotes if possible - or at least paraphrases, that led you to do what you did, could be handy)

For "cleaning", I have checked for errors in data entry, outliers (the reason they are there and whether they appear to have marked influence on the mean).

Okay. How do you tell if something is an outlier in the absence of a fitted model?

I have done Friedman's tests for the overall within comparisons (and Wilcoxon tests for paired comparisons).

Let me double check your use of the word 'mixed' in this earlier phrasing:

my design is set up for a 2 x 5 mixed factorial (the between group is age - young/old - and the within groups factor is different forms of input methods)

I take it you mean it's both 'between' and 'within' (repeated measures)?

I cannot find a nonparametric alternative to a 2 x 5 mixed factorial. If you can recommend one, I'd be appreciative!

Well, as this suggests - and as you might have anticipated - you could use Friedman type approach to some of the differences, but it doesn't handle the entire design.

I am reasonably confident there are some tests that might work here, but let me come back to you on this if it becomes necessary

So far, I've violated normality (which I have read is least important out of the assumptions), sphericity (but have used the corrections), and homogeneity (which is what sparked my interest in using nonparametrics, I have a variance ratio of 5:1 for one condition).

Nonparametric approaches usually tend to assume the same distribution aside a possible location shift - few of them are meant to deal with both differences in mean and differences in scale at the same time.

When you say 'in one condition' what are you actually calculating the variance of? Is it residual variances from your full model? What are the sample sizes in the two variance estimates?

The reason I ask is if you're doing it on anything but residuals you're confounding variance with difference-in-mean effects.

I know that violating some of these assumptions may lead to inflated alpha levels and reduced power. This is why I was "double checking" with nonparametric tests.

Can you describe in specific detail what tests you did?

So far, my effect sizes are large and my p values are small on conditions that I would have expected to see differences. So I think I can live with these violations?

It depends! We may be able to show that the impact of them on your significance levels should not be large

It's just hard to know what effect they are truly having on my results.

That's an important issue, yes.

The recoding data to +/-1 the next most extreme score was recommended as an option in Tabachnick and Fidell's textbook.

Hmm. I haven't seen that book in a long, long time. I see it's been through a bunch of editions since I last saw it. They tell you to change your data values in this way when you have outliers in this circumstance?

I have 25 people in each of my two groups. I want to keep their data, but make their extreme scores a little less influential on the rest.

Extreme scores are not necessarily a problem; it depends on what makes them 'extreme'. I still can't tell for sure that your assumptions are actually violated.

With this sort of sample size you probably won't even be able to reliably tell if they're violated.

I have never done bootstrapping before.

While it's not hard, and might solve your problems (though I am not fully convinced yet that you actually have any), I'll try to avoid suggesting it if there's a reasonable alternative.

You may be right about not trying to alter my data too much. I don't like doing it. Mainly the transformations are for handling variance discrepancies.

How exactly are you assessing differences in variance?

Thank you so much for the quick responses. Honestly, I hadn't looked at my residuals. But now that I have, I'm not entirely sure how well they "fit".

I don't know what this means. Can you elaborate?

Are the Q-Q plots the best tool to use to check?

That's what I'd use, yes. How many subjects are in each of your input methods?

I checked the residuals through Explore and the skewness and kurtosis values are not too bad (I think).

Again, though, there is one variable that is skewed .3 and another that is -.7. Is this a problem?

I don't see that it should necessarily be. You might easily see that kind of variation with perfectly normal data. How big are the sub-samples on which these are calculated?

My dependent variables are different. Most of them are continuous (performance in WPM, and also likert-type response scales). Another type of variable is error rates (percentages).

Wow. You have multiple responses and you didn't think that was worth mentioning?

1: What the heck is WPM? How is 'performance' measured?

Is this 'Words per minute'? I probably wouldn't expect that to be normal; indeed, as a rate, and moreover one likely to have variance related to the mean, I'd be inclined to look at this in a GLM, perhaps using a quasi-Poisson model.

2: Likert scales aren't continuous. Is your response ordered categorical or something else? Are you constructing some overall score by adding results from several such scales?

These might bunch up one end for some subjects and the other end for others. Is this where you are getting those different skewnesses?

3: Error rates as percentages? You shouldn't expect these to satisfy normality. But you also shouldn't need them to, since you ought to be able to do some form of GLM-type model. These would be binomial unless your denominators are large and your rates are small, where you'd probably treat them as Poisson.

If this is done as a GLM I have to think about how the design needs to be modelled in them (these kinds of designs aren't really my area; I'd be inclined to look at mixed models, but that's not necessarily easy with a GLM; I don't know if SPSS does mixed-model GLMs - indeed, it's many decades since I used SPSS).

Akritas has written a bunch of papers on nonparametric tests, quite a few are in JASA, a number of which look pretty directly relevant to your situation.

--

I'm not convinced yet that you have any problems at all.

We need to resolve whether:

1) you really have violation of your variance assumptions

(If your variance assumptions are really violated, many nonparametric methods are also likely to be affected.)

2) you really have substantial non-normality. Your sample size is so small you probably can't tell very well. (Which is potentially an argument against assuming it's okay.)

==

As my answer stands, it's a bit 'too localized' (that is, for the moment we're being too specific on details of your problem that won't generalize well enough for a CV style answer) - we'll need to clean up a bit later by modifying your question to include a lot of the details I am asking about - and then hopefully modify my answer construct a better, more general CV-style answer that might be of value to someone else.

Related Question