Solved – The sample size applied to a non-normal distribution

distributionsnormal distributionrrandom variablesampling

I have a single variable that represents my population values (sample of data):

[1]  94.51  59.81  63.84  94.51  94.51  94.51  94.51  94.51  94.51  94.51
[11]  59.81  94.51  94.51  94.51  47.90  29.16  50.36  23.51  44.41  33.14
[21]  47.90  29.16  47.90  29.16  47.90  29.16  47.90  29.16  47.90  29.16
...
[331]  23.44  24.52  12.37  29.12  24.52  12.37  29.12  24.52  12.37  29.12
[341]  24.52  12.37  29.12  24.52  12.37  29.12  24.52  12.37  29.12  24.52
[351]  12.37  29.12  24.52  12.37  45.25  25.78  49.84  29.12  24.52  12.37
[361]  29.12  24.52  12.37  29.12  24.52  12.37


> summary(group$V1)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
6.11   35.94   59.13   62.31   86.10  111.50 
> mean(group$V1)
 [1] 62.30546
> sd(group$V1)
 [1] 29.55491

The corresponding histogram is:
Histogram of BitScores of the Population
And the Shapiro test of normality:

Shapiro-Wilk normality test


    data:  group$V1
    W = 0.9466, p-value = 3.161e-10

With the last information my conclusion is that the population is not distributed normally. The objetive is extract a sample from these population, but I have problems to apply a method to determine the sample size, because in some methods the assumption is based on the normality of population. (According with these reference)
The sample is required to comparate this group with a random group with the same sample size, and the single variable to evaluate is the Bitscore.

Some references, suggestions, approaches?
Thanks in advance.

Best Answer

I'm going to start by accepting your claim that you have the population, but I'll come back to this issue at the end.

1) If you actually have the target population, then hypothesis tests - which are based on assuming you have samples, not populations - are pointless. You can answer such questions by inspection. If that's the population about which you wish to make inferences, it's plainly not normally distributed. The p-value is irrelevant.

2) Before worrying about whether your population is normal, first worry about whether you do actually need that assumption for something ... and then work out how much of an issue non-normality might be. So which particular things do you need to assume normality to use, and how critical is some degree of non-normality to their results?

3) For this kind of purpose, hypothesis tests of distribution shape don't really answer the right question in any case. e.g.1, e.g.2

Now, to try to address your underlying question, which relates to determining sample sizes for hypothesis tests.

a) You say you have the population. Why do you need hypothesis tests at all? Just look at the population. What to see if some mean value differs from some hypothesized value? You have the population mean already, so just look at the number! Is it the same number as the hypothesized value or not?

b) Let's say there is some reason to do a hypothesis test when you have the population. You can just simulate samples from your population (by drawing randomly from the population of values) in order to find the minimum sample size with the required characteristics. But since the simulations would actually be the samples, your question would already be answered by then choosing one of your simulated samples at random and labelling it 'My Sample'. [Quite why one would be interested in such a performance is beyond me, but when you have the population, that is drawing a sample.]

At the end it sounds like you want to compare a particular subgroup with the population as a whole on a particular variable, but you don't say what you want to compare about them - means? some general notion of location? spread? distributional shape?

Why would you need the groups to be the same size?
You say you have the population. The subgroup is therefore the population of that subgroup. Whatever you want to compare, you just compare the numbers and see if they're the same. (Of course, they won't be - you know this before you start. This is a dumb exercise, because you're trying to answer a question you already know the correct answer to.)

[Finally, I'm going to make a little bet. I bet you don't actually have the population about which you wish to make inferences. I bet you wish to extend your inference outside of those 366 values to something broader - your actual target population. This is no doubt part of the reason why you retain some urge to do hypothesis tests.]

Best Answer

Related Solutions

R Data Visualization – How to Use QQPlot to See Whether Data Are Normally Distributed

Solved – Testing for significance between means, having one normal distributed sample and one non normal distributed

Related Question