[Math] Choosing stratification variable for stratified sampling

samplingstatistics

In general, how do you choose which variable to stratify your sample over?

More specifically, what should the proportions look like on the variable your stratifying over?

For example, I have data that I can stratify based on either age (15-90 years old), race, sex, or marital status. With this data, I am trying to estimate total income for the population.

When I stratify over race, I get

> stratsizes = table(ipums_data$Race)
> prop.table(stratsizes)

         1           2           3           4           5 
0.863919493 0.109874488 0.006079198 0.016572829 0.003553993 

where the numbers 1 through 5 represent different races.

When I stratify over marital status, I get

> stratsizes = table(ipums_data$Marstat)
> prop.table(stratsizes)

        1          2          3          4          5 
0.58285479 0.02218440 0.06168983 0.07685977 0.25641122 

where again, the numbers 1 through 5 represent the different marital statuses.

Stratifying through age gives an pretty even proportion through all the ages 15-90, and stratifying with sex gives a 50-50 proportion

What variable would be the best to stratify over, and why?

Best Answer

It depends on which stratifying variable(s) is/are most interesting. (In an extreme case you might get an example of 'Simpson't Paradox'; google it if you don't already know what that is.) Also, it might depend on the purpose of the study.

If I understand your examples, you have shown nothing about income; only proportions of the whole sample at each level of the two stratifications. Income data for 'levels' with very low proportions might not be reliable. (For example, race levels 3 and 5; marital category 2.) You might consider combining some levels to get larger proportions in all levels, provided that combined levels make sense.

How small is too small depends on the total sample size. If the entire database has a million people, then a level with 2% still represents a lot of people--enough for a meaningful median or mean. But if you have 150 people altogether, then combining is clearly indicated.

Are you interested in discussing income inequality? Do different race 'levels' show remarkably (and statistically significantly) different incomes? How about marital status levels? Is there an age group with peak income? And so on.

Related Question