Solved – Stratified sampling when there are too many strata

samplingstratification

I am working on a sampling exercise which involves both random and stratified sampling. The sampling is for a series of experiments run on multiple populations. Some experiments may run on the same population either during non-overlapping time periods or might overlap at the same time.

When the test don't overlap then i do simple random sampling but if they do overlap at the same time then for the first test i do random sample and for the second experiment i perform stratified sampling to ensure there is no bias and also to study the incremental value of the 2nd experiment.

I am facing troubles with the stratification in terms of the number of treatment groups. This particular population group has 99 people in it. A case that i am currently facing is as below:

Experiment 1: Control-50 people; Test-49 people
Experiment 2: Control-6%; 16 Test-groups adding upto 94%

For experiment 2, the 6% for control is a mix of 6% of the population each from control and test from the first experiment. And the same for each of the treatment groups. When the # of treatments is low, this is not the issue but when there are cases like above where there are 16 treatments i end up with a unbalanced group like below

Split of test-control is 
control: 6%, T1-6%,T2-6%, T3-6%.......T15 & T16-5%. Thus adding up to 100%

Experiment 2: Control = 6% of Exp1.Control + 6% of Exp1.Test
              T1 = 6% of Exp1.Control + 6% of Exp1.Test
               .
               .
               .
              T15 = 5% of Exp1.Control + 5% of Exp1.Test

Expected Split of people across the control and 16 treatment is :

6-6-6-6-6-6-6-6-6-6-6-6-6-6-6-5-5

Even though the split for 1st experiment is 50-50 since there are only 99 people, control gets 50 people and test gets 49 ppl or vice versa.

But during stratification 6% or 5% of 50 & 49 people results in decimals and hence i using the lower function. And because of that i am getting this split instead of what is expected:

5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-20

So instead of using Floor, using Ceiling would most likely solve the problem in this case but it will cause a problem if the # of groups is more than 17. At some point there wont be any for the last group if i go with ceiling and there will be skewed split if i got with floor.

So is there a way to find at how many number of splits will it mathematically not be possible to do stratification?

Best Answer

One possible answer to your problem is to not stratify so strictly. Don't always use floor nor always ceiling. Distribute as many observations as you can so that you have an equal number in each treatment. Distribute the rest with some sort of random number generator.

But even then, I think you are stratifying too much. It is always a trade-off between enforcing presence of relevant observations in all strata and interfering with the randomness of sampling. At the level of stratification you want to enforce, you are practically handpicking the observations into the strata. That calls into question the random i.i.d. distribution of your treatment samples and thereby any inference you can do based on it.

So is there a way to find at how many number of splits will it mathematically not be possible to do stratification?

Before you run into the problem of not being able to have uniformly stratified groups because rounding up- or down will make too much of a difference to the group sizes, you will have run into many other, I think bigger, problems.

  • As I mentioned, this is not really random sampling anymore
  • Each test will have extremely low power with such a small $n_g$ in each group (if $n_g$ was remotely large enough for statistical power considerations, it would also be large enough that rounding it up or down wouldn't make a relevant difference)
  • Not only is each test underpowered, you will also have multiple comparisons and need to correct for false discovery rates which further reduces power

How to deal with your current problem? Don't try to do exploratory analysis and hypothesis testing at the same time: Reduce the number of treatments through exploratory analysis before you test the remaining credible candidates with new data. (Or, if you must do it all at once, have more budget for much larger $n$ and do a rigorous power analysis to determine that $n$.)

Related Question