Solved – Population with known distribution (not normal) versus small sample

bootstrapprobabilitystatistical significance

I have a complete population consisting of 6587 individuals (genes) and for each individual I have a distance value. These distance values are not normally distributed.

I'm interested in determining if a sample of 348 individuals (which has a smaller mean distance value compared with the mean distance value from the complete population) is significantly different than expected.

I have limited statistics background and would appreciate advice on what statistical test to use.

Ideas include:

  1. two sample t-test (probably not right, since its not a normal distribution)
  2. wilcoxon rank sum test (buddy suggested this)
  3. some other test that's easy to implement in R
  4. custom test – one idea I had was to randomly pick out 348 individuals (with replacement) and determine their mean distance. Then repeat this 1000 times (I think this is called bootstrapping). I did this, and I get a normal distribution of values with mean 32000 (same as population mean ) and sd = 2000. The mean distance value for the actual sample of 348 individuals was 18000. So by 1-tail p-value from normal distribution of mean 32000 and sd=2000 this appears significant.

Thanks in advance for any advice!

Best Answer

I'm going to take you at your word that the population really is a population in the right sense.

I'm more interested in comparing the mean of the smaller sample with the mean of the total population.

If you know the distribution of the total population, you know its mean. So you're in the position of trying to compare a sample mean to a population mean. Your sample is of size 348. This is a one sample test, not a two sample.

Points (corresponding to your numbering):

1) The CLT should give you approximate normality of the sample mean, unless the distribution is pretty extreme (heavily skewed, for example -- you can check the approximate normality of the distribution of the sample mean in several ways); by the same token, you should be able to ignore the uncertainty in the standard deviation. You don't need to worry about whether the one sample t-statistic has a t-distribution because a one sample z-statistic (basically the same quantity but taking $s$ to be $\sigma$) should be very close to a z-distribution.

2) The relevant rank test is not a Wilcoxon rank-sum test because you know the population. The one-sample equivalent is a Wilcoxon signed-rank test, but that assumes symmetry; it's also not testing the mean, but a slightly different measure of location. If you don't have near symmetry the obvious rank based one sample test would be the sign test, which is also not testing the mean. IF you consider only shift alternatives and you assume the only difference between sample and population is that possible shift, you could test against the population equivalent of the locations that they do test for, and if they're different, by direct implication the means differ by the same amount. In the case of the sign test, that would mean testing against the population median in order to also test for a difference in means. However, I suspect that the required assumption of identical shapes apart from location shift is not likely to be tenable - let me know if you think shift-only alternatives make sense.

4) Resampling approaches (either using randomization or bootstrapping) are possible.

Lets discuss some randomization tests:

a) If you could assume symmetry, a test immediately suggests itself, based on the same randomization idea that the Wilcoxon signed rank test uses, but based on the data, not the ranks - that is you take away the population mean from the sample values, count how many + and - signs there are, and randomly reassign the signs to those differences to get the randomization distribution.

b) You could randomly sample 348 values from the whole (6857+348) combined sample-plus-other-population and calculate the distribution of means, comparing your mean with that distribution (I'd lean toward this).

Related Question