Solved – Cluster-robust SE in Stata when using a survey design

clustered-standard-errorsstandard errorstatasurvey

I'm working with data from a clustered sample where observations have a certain sampling weight (pweight). There are two ways to obtain the correct point estimates: I) using reg yvar xvar [pw = pweight] or ii) using svyset[pw = pweight] and then svy : reg yvar xvar These return identical point estimates (as they should). However, once one wants to introduce cluster-robust standard errors, the "manual" approach and the svyset approach return slightly different results. What I mean by "manual" is a command of the form: reg yvar xvar [pw = pweight], cluster(clustervar) as opposed to: svyset clustervar [pw = pweight] and then svy : reg yvar xvar. Here is a little code example to illustrate this with some numbers:

 sysuse auto
 set seed 92122
 *a variable containing random integers from 1 thru 4 designating fake clusters 
 gen mycluster = ceil(4*uniform())
 *random probability weights as the inverse of some random sampling probability
 gen mypw = 1/uniform()

 *run the "manual" regression
 reg price mpg weight [pw = mypw], cluster(mycluster)

 *using svy design
 svyset mycluster [pw = mypw]
 svy : reg price mpg weight

The standard errors are very close to one another but not identical (mpg is 72.48 and 71.48 and weight has 0.969 and 0.956). Stata calls the ones from the svyset-regression "Linearized" so I suppose that's where the difference comes from – potentially a Taylor expansion? Could somebody point me towards the precise (mathematical) difference? Are the patterns, i.e. one is always larger than the other?

I previously posted this question on Stackoverflow but it was deemed more appropriate here.

Best Answer

A standard error for regress with a cluster() option will always be larger than that from svy: regress, with the ratio of squared standard errors equal to $\dfrac{n-1}{n-k}$, where $n$ is the sample size and $k$ is the number of predictors, including the intercept.

Here's a modification of your example to demonstrate this. The change in definition of cluster should make no difference.

sysuse auto, clear
set seed 92122
gen mkr = substr(make,1,2)
codebook mkr // 23 clusters
*run the "manual" regression
reg price mpg weight [pw = gear], cluster(mkr)
scalar v1 = _se[mpg]^2
scalar k = e(rank) // number of predictors + 1
scalar n = e(N)

*using svy design
svyset mkr [pw = gear]
svy : reg price mpg weight
scalar v2 = _se[mpg]^2
di " v1 = " v1   "    v2 = " v2
di " n =  "n   "    k = " k
di " v1/v2 = "v1/v2
di " (n-1)/(n-k) = "(n-1)/(n-k)

The results of the display statements:

v1 = 5364.5846 v2 = 5217.6097
n = 74 k = 3
v1/v2 = 1.028169
(n-1)/(n-k) = 1.028169

The values in the last two lines are identical. Page 469 of the Stata 14 Manual entry for _robust, (http://www.stata.com/manuals14/p_robust.pdf) refers to a multiplier $\dfrac{n}{n-k}$ that should make the two calculations equivalent. I verified the ratio $\dfrac{n-1}{n-k}$ in other examples. I can't account for the difference.

Related Question