I am primarily using Stata for my survey analysis, but my question is also applicable to R or SUDAAN.
There are a number of different ways of accounting for singleton strata in survey analysis. Just to be clear, when I say "singleton strata", I mean that there is only a single sampling unit within one or more strata in my analysis. The problem this poses is primarily for variance estimation. Some sources recommend either leaving those strata out of the analysis entirely or merging that unit into another strata within the analysis. These methods can be justified in a number of scenarios, and have been discussed in detail in other online sources, but suffice to say that I do not find them appropriate for my data set. Another method is to generate replication weights (bootstrap or jackknife, etc.). This is possible for my data set, but would require extra work that I'm not convinced is necessary at this point.
Rather, I want to focus on the 3 methods you can feed into the singleunit("")
command in STATA, or the options(survey.lonely.psu="")
command in R. These methods are:
1) Certainty: this makes it such that a strata with a single unit does not contribute to the standard error estimates at that level of sampling. I believe this is equivalent to specifying a finite population correction of 1.
2) Scaled: this makes it such that the contribution to the standard error for that singleton strata is set equal to the average estimate for all of the other (multiple unit) strata. This is essentially mean imputation.
3) Centered: this calculates the standard error estimate for the strata based on the distance from the grand mean across strata rather than a strata mean.
In practice, when I try all 3 methods on my data, the differences are minimal. The largest difference in standard errors was 0.001, which is not something I am particularly worried about. However, I would still like to understand the theoretical differences between the methods in more detail before I decide which one to implement in the final analysis, even if practically there is no difference.
So, basically, what are the theoretical distinctions between these techniques? When might you choose "certainty" versus "scaled" versus "centered"? The one I see used most often when I search around the Internet is "certainty", but none of the sources have actually provided a justification for that. I would guess that "centered" is the most conservative of the 3 methods since it likely overestimates the standard error. Beyond that, though, I am at pains to think of any rigorous way of deciding between the methods.
Anyone have any suggestions? Or even know a citation where someone evaluates these methods in a simulation study?
Best Answer
Also looking for an answer to this. The closest I've found so far is from the R survey package manual:
Other sources (e.g. here) seem to suggest that selecting
certainty
is only appropriate when there is only a single "large" PSU in this stratum in the population, but also note that using approximate methods can give values that are too high.