Solved – Dealing with singleton strata in survey analysis

survey

I am primarily using Stata for my survey analysis, but my question is also applicable to R or SUDAAN.

There are a number of different ways of accounting for singleton strata in survey analysis. Just to be clear, when I say "singleton strata", I mean that there is only a single sampling unit within one or more strata in my analysis. The problem this poses is primarily for variance estimation. Some sources recommend either leaving those strata out of the analysis entirely or merging that unit into another strata within the analysis. These methods can be justified in a number of scenarios, and have been discussed in detail in other online sources, but suffice to say that I do not find them appropriate for my data set. Another method is to generate replication weights (bootstrap or jackknife, etc.). This is possible for my data set, but would require extra work that I'm not convinced is necessary at this point.

Rather, I want to focus on the 3 methods you can feed into the singleunit("") command in STATA, or the options(survey.lonely.psu="") command in R. These methods are:

1) Certainty: this makes it such that a strata with a single unit does not contribute to the standard error estimates at that level of sampling. I believe this is equivalent to specifying a finite population correction of 1.

2) Scaled: this makes it such that the contribution to the standard error for that singleton strata is set equal to the average estimate for all of the other (multiple unit) strata. This is essentially mean imputation.

3) Centered: this calculates the standard error estimate for the strata based on the distance from the grand mean across strata rather than a strata mean.

In practice, when I try all 3 methods on my data, the differences are minimal. The largest difference in standard errors was 0.001, which is not something I am particularly worried about. However, I would still like to understand the theoretical differences between the methods in more detail before I decide which one to implement in the final analysis, even if practically there is no difference.

So, basically, what are the theoretical distinctions between these techniques? When might you choose "certainty" versus "scaled" versus "centered"? The one I see used most often when I search around the Internet is "certainty", but none of the sources have actually provided a justification for that. I would guess that "centered" is the most conservative of the 3 methods since it likely overestimates the standard error. Beyond that, though, I am at pains to think of any rigorous way of deciding between the methods.

Anyone have any suggestions? Or even know a citation where someone evaluates these methods in a simulation study?

Best Answer

Also looking for an answer to this. The closest I've found so far is from the R survey package manual:

Using "adjust" [centered in Stata] is conservative, and it would often be better to combine strata in some intelligent way. The properties of "average" [scaled] have not been investigated thoroughly, but it may be useful when the lonely PSUs are due to a few strata having PSUs missing completely at random.

The "remove" and "certainty" options [certainty] give the same result, but "certainty" is intended for situations where there is only one PSU in the population stratum, which is sampled with certainty (also called ‘self-representing’ PSUs or strata). With "certainty" no warning is generated for strata with only one PSU. Ordinarily, svydesign will detect certainty PSUs, making this option unnecessary.

Other sources (e.g. here) seem to suggest that selecting certainty is only appropriate when there is only a single "large" PSU in this stratum in the population, but also note that using approximate methods can give values that are too high.

Related Solutions

Solved – Survey regression in R with singleton PSUs

You need to install the survey package. Here is an example of how to define the survey design you have specified and how to run a linear regression on these data. I assume that the dataset has already been loaded.

require(survey)
options(survey.lonely.psu = "adjust")
design1 <- svydesign(id = ~psuid, strata = ~stratvar, weights = ~weightvar, data = mydata)
model1 <- svyglm(y ~ x1 + x2, design = design1)
summary(model1)

IMHO, Thomas Lumley's homepage is an excellent starting point for this kind of things.

Rather than only installing the survey package, you can install the Official Statistics task view:

install.packages("ctv")
install.views("OfficialStatistics")

This task view gives you a rather nice and complete toolbox to work with survey data.

Note that with Stata's svyset command you have basically the same possibilities than you have in R to handle singleton sampling units.

Solved – Is it possible to manually calculate standard deviation for a multiply-imputed survey variable based on the standard error (SE)

If you really want to get to the standard deviation of the population distribution, you should mi xeq : generate y2 = y*y the squares of the variables, and then

mi estimate (sd : sqrt( _b[y2] - _b[y]*_b[y] ) ) : svy : mean y y2
mi testtransform sd

Note that the interface of multiple imputation and inference with complex survey data is extremely poorly researched into given the ubiquity of the issue. I outlined the literature and the steps elsewhere on statalist.

Best Answer

Related Solutions

Solved – Survey regression in R with singleton PSUs

Solved – Is it possible to manually calculate standard deviation for a multiply-imputed survey variable based on the standard error (SE)

Related Question