Solved – How to test for statistical differences between sets of skewed data

data transformationprobabilitystatistical significance

I have two sets of probability values (one with 100 rows and another one with 500 rows). The data is skewed.

I want to test if there are significant differences between both groups.

Questions:

1) What is the best approach/statistical test to do so?

2) Should I reclassify the values in classes (for instance: 0-0.1, 0.1-0.2, …, 0.9-1)? So when comparing both sets I have the same number of rows.

3) Is it necessary to transform the data?

Best Answer

The information that the data are skewed is too vague to allow very precise advice.

Regardless of testing or modelling, you should plot the data. A good plot to compare the overall characteristics of two distributions is a quantile-quantile plot.

What you say seems consistent with several quite different approaches.

  1. The data are skewed but comparison of means still makes sense. A t-test will often work quite well in this situation, but watch out.

  2. The data are skewed and the most useful comparison may be to use a Wilcoxon-Mann-Whitney test.

  3. The data are skewed and are better analysed on a transformed (e.g. logarithmic) scale. That need not entail transformation, as a generalised linear model could be used directly.

Tastes and styles differ; my own view is that it may not be wrong to apply two or three of these, not at least to answer dogmatists who may be convinced that there is one correct way to proceed.

If you post the data and explain what they are (which sometimes makes a difference to what makes most scientific sense) detailed advice is more likely. An important detail in any case is whether negative or zero values occur in practice or are possible in principle.

You refer to probability values, which itself suggests values bounded by 0 and 1. If they are really are probabilities, some other scale may make more sense (e.g. logit).

Note: On one point I am dogmatic: I can see no reason at all for your #2, binning of data. That just throws away detail arbitrarily and for no good reason.

Related Question