Solved – Comparison of average values of data sets

density functiondistributionsmeansubset

I am working with two data sets of unequal sample sizes of 998 and 857.
Average of first (998 samples), come out to be higher than the other dataset.
To my surprise, when I split my complete data into unequal halves; first into 803 and 195 samples and other also into 819 and 38 samples.
then comparison of average values of 803 subset of first complete dataset with the average of 819 subset of second complete data set showed a reverse trend in their mean values.
Same reverse trend was observed with other subset of both the data set.

My question is that is it possible that if mean of total items in A>B, their two subsets showed reverse trend in their means i.e. mean of A1

Is this because of unequal sample sizes???? or because of sample density distribution trend????? or both????

If this is possible also, then is there any way to explain this quantitatively??????

It would be really helpful, if anyone can help me on this..

Best Answer

The answer is "Yes". This is Simpson's paradox applied to mean differences instead of odds ratios. You can read Wiki's article (http://en.wikipedia.org/wiki/Simpson%27s_paradox) to understand the mechanisms behind it. It's a projection problem: If you only see a two dimensional projection of a three dimensional object, you can get quite a wrong impression about the whole picture. In balanced settings (equal group sizes), this is not possible.

Consider, for instance, the following simple setting:

  • $A_1$ consists of 99 times the value 1
  • $A_2$ consists of the value 100
  • $B_1$ consists of the value -9
  • $B_2$ consists of the value 99

The average of $A = A_1 \cup A_2$ is about 2 and thus much smaller than the average 45 of $B = B_1 \cup B_2$. On the other hand, the average 1 of $A_1$ is larger than the average -9 of $B_1$. Similarly, the average 100 of $A_2$ is larger than the average 99 of $B_2$.