Survey Sampling – How to Calculate Confidence Levels of a Stratified Sample with Missing Units

non-responsesamplingsurveysurvey-samplingsurvey-weights

I'm conducting a stratified survey among health institutions; 700 units were allocated into one of the 30 strata designed to reflect the proportionality of the population. However, it's been difficult to obtain even 450 questionnaires, even after having extending the period of the project, and that I need to end it ASAP.
A professor asked me whether the survey will remain valid, i.e., statistically valid if we end the survey with 450 interviews. The survey was planned to have 3% of margin of error with a 90% level of confidence, and $\hat{p}=0.5$ for unknown proportion split. The overall returning rate is 72%, with most of strata returning about 80%, but we've 6 strata with a returning rate just above 40%.
At this stage, we wonder what would be the confidence level for each stratum, is it possible to compute? Is there a procedure that we can use for assessing the confidence level of a stratified survey to help in the decision how far can we go with this? Also, is it right to reweight/calibrate the actual sample, so to reflect the data, or should we keep the full sample (700) with all the missing units (250)?

Best Answer

It looks like you've done all you could. The strata with the high non-response rates apparently do not constitute a large portion of the population. In retrospect, I would have suggested a smaller sample, with pilot tests, and more time devoted to follow-up of a random sample of non-responders.

Here are my thoughts on what you should do:

Reweighted analysis

You should reweight to correct for non-response. Define

$$N_h = \text{number of institutions in stratum }h $$

and

$$ m_h = \text{number of responding institutions in stratum }h $$

Then you can remove non-response bias related to stratum membership by running a survey program with weight defined for each institution in stratum $h$ as

$$ w_h = \dfrac{N_h}{m_h}$$

This will not remove non-response biases related to other, within-stratum, factors, but it's better than doing nothing.

Do a stratified analysis with a survey program

The original sample size calculation apparently was the one appropriate for a simple random sample, not for a stratified random sample; i.e. it assumed that the estimated proportion would be the overall sample proportion.

Instead, you should use a survey analysis program that accepts stratum and weight information. Stata and SAS contain such programs. They will compute a stratified proportion, with an estimated standard error that will be smaller than that of the ordinary sample proprtion. You won't know exactly what the bound on error will be until you do the calculation.

You can estimate a confidence interval for every stratum, but be aware that the relevant sample size is the number of responding observations ($m_h$) in the stratum. The average of these is 450/30 = 15, so some intervals will be very wide.

You can, of course, consider subsets of the population, including groups of strata or subsets defined by characteristics measured during the survey. Such subpopulation standard errors require a special formula, but every package with survey capabilities (e.g. Stata, SAS, Survey Package in R) will use it.

Added To answer your question about the sample to keep. The analysis will be based on the 450 responding institutions in the sample, but you will need to add information about the numbers of institutions in the population. You should keep the 250 non-responding institutions in the data set. They won't affect the analysis because the values of all measured variables will be missing You can also set the weight variable to zero or missing), but you need them to make a table describing response rates by stratum.

Related Solutions

Solved – Stratified random sampling when strata overlap

If you really have to divide the population into 5 strata, you need to make those strata mutually exclusive. You can achieve that by assigning the visitors to one department only, and that department can be the one they visited the most. Let's look at the following fake data set where we are asked to separate the population of 4 visitors who visited 2 departments into 2 strata, while the departments are not non-overlapping in terms of their visitors.

visitor | department | visitedTime
----------------------------------
   1    |     a      |      5 
   1    |     b      |      4 
   2    |     a      |      0 
   2    |     b      |      2 
   3    |     a      |      9 
   3    |     b      |      1 
   4    |     a      |      0 
   4    |     b      |      5

We see some visitors visited only one department (2 and 4), and the other two visited both departments, causing the stratification to fail. We can collapse this data set to the most visited department per visitor, and get

visitor | mostVisited
---------------------
   1    |     a      
   2    |     b      
   3    |     a      
   4    |     b

Departments are non-overlapping in the second table. From and interpretation stand point, I think it makes sense to ask the visitor about the department they visit the most. Therefore this strategy is perfectly rationalizable.

Now, you may have a highly unbalanced picture such that one department is always/never the most visited. Think about the reception, everybody has to visit it once per visit, causing that the most visited department. If you can omit such a department, you should, to make your life easy. If you can't, you can keep this department outside of stratification, i.e. sample a certain amount of visitors from it first, and then apply the "collapsed stratification" I just discussed to the remaining departments.

Best Answer

Related Solutions

Solved – Stratified random sampling when strata overlap

Related Question