Solved – Standard errors in weighted least squares on aggregated data

aggregationregressionstandard errorweighted-regression

I am interested – mostly just for my own knowledge, and not for any real problem – in the use of weighted least squares to estimate a model on individual-level data and aggregated versions of those same data. Here's some simple R code to show what I'm talking about. Basically, I estimate a regression model lm1 on some data (old), then I aggregate the data (to create new) so that I only have one observation for each combination of $X1$, $X2$, and $Y$, and weight based on the number of observations at each combination of $X1$, $X2$, and $Y$.

The resulting coefficients in a regression (lm2) on new are the same as those in lm1, but the standard errors are larger. But they are larger by a fixed amount (i.e., the SEs from lm1 are proportionate to one another). How do I figure out what that ratio is? I've searched around for calculations of SEs for weighted least squares, but that has proved unhelpful.

set.seed(1)
n <- 1000
old <- data.frame(x1=sample(1:3,n,TRUE), x2=sample(1:3,n,TRUE))
old$y <- old$x1 + old$x2 + sample(-3:3,n,TRUE)
    old$i <- with(old, interaction(x1,x2,y, drop=TRUE))
levels(old$i) <- seq_along(levels(old$i))
s <- split(old, old$i)
    new <- do.call(rbind, lapply(s, `[`, 1, 1:4))
    new$w <- sapply(s, nrow)

lm1 <- lm(y ~ x1 + x2, data=old)
lm2 <- lm(y ~ x1 + x2, data=new, weights=w)

# coefficients equal?
all.equal(coef(lm1),coef(lm2))
## [1] TRUE

# ratio of SEs
summary(lm1)$coef[,2]/summary(lm2)$coef[,2]
## (Intercept)          x1          x2 
##   0.2453172   0.2453172   0.2453172

Why are the SEs from the aggregated model proportionate to those from the individual-level model? And how do I know what their ratio is?

Best Answer

The value of the ratio can be obtained as follows, noting that the discrete sampling in lines 3 and 4 of the code implies 3 options for $X1$, 3 for $X2$, and 7 for $Y$ conditional on each combination of $X1$ and $X2$:

$(N – K)_{old}$ = 1000 – 3 = 997

$(N – K)_{new}$ = (3 x 3 x 7) – 3 = 63 – 3 = 60

Ratio of standard errors = $\sqrt\frac{(N – K)_{new}}{(N – K)_{old}} = \sqrt\frac{60}{ 997} = 0.2453172$

Why is $(N – K)$ relevant here? Because it is the denominator in the formula for the OLS estimator $s^2$ of the variance parameter $\sigma_0^2$:

$s^2 = \frac{SSR}{N – K}$

where SSR is the sum of squared residuals. This in turn feeds into the estimator of the variance of the coefficient vector $\sigma_0^2$, with $X$ as the matrix of independent variables:

$V[\hat\beta| X] = \sigma_0^2.(X’X)^{-1}$ estimated by $s^2.(X’X)^{-1}$

and so into the standard errors of the coefficients. These formulae also apply to the WLS coefficients provided that SSR and X are based on the weighted variables (WLS being equivalent to OLS on weighted variables).

However, it is important to note (especially for anyone who may be using regression with aggregate data in real applications such as in economics) that the simple ratio formula above only works because of particular features of this case. In general the effect of aggregation on standard errors is more complex, with effects via the SSR and $X’X$ (which happen to cancel out in this case) needing to be considered as well as those via $(N – K)$.

A relevant feature of this case is that aggregation does not group together different $Y$ values, each combination of $X1$, $X2$ and $Y$ forming a separate aggregation. Thus there is no averaging of $Y$ values which would tend to reduce the residuals. Suppose, by contrast, that a sample has two observations of $y$ for each observed $x$ value and that in each case the two observations happen to lie on opposite sides of the fitted line, but the same distance from it. Then regression on the unaggregated data will produce non-zero standard errors of the coefficients, but aggregation at each $x$ value (that is, averaging of its two $y$ observations) will produce a perfect fit with zero residuals and therefore zero standard errors. In that case, therefore, the zero SSR in the aggregate model will dominate any effects via $(N – K)$ and $X’X$.

Related Solutions

Solved – Using Weighted Least Squares with Robust Standard Errors

I break your concerns about the estimator into two areas: efficiency and asymptotic validity. I'll define a procedure as asymptotically valid if the point estimates are consistent, and the estimated variance-covariance is consistent. An extension of Alecos's arguments show, the robust (ie, sandwich) standard errors result in asymptotic validity, regardless of the assumed weighting matrix, and in fact this result even holds for clustered/correlated data (as long as independence holds on at the uppermost level of clustering).

I'll define the efficiency of the estimate as the true asymptotic variance/covariance matrix of the coefficients. Of course, from Gauss-Markov we know that only when you select weights proportional to the inverse conditional variance of each observation will you achieve the best limiting unbiased limit.$^1$ So based on first order, asymptotic concerns, we may just take the best stab at estimating the weightings we can, then go ahead and just robust standard errors to guard against mistakes in the weights.

To say anything more refined then this we need to think of second-order asymptotic or finite sample concerns. An example of a second order concern might be "variance of the variance." While I don't have the inclination to try to make Aleco's argument rigorous, I believe it does hold--that when you estimate additional, unnecessary parameters you will introduce additional variance in the remaining parameters. (You might be able to make it rigorous by considering schur decompositions of blocks of the information matrix?) So there is probably a second-order bias-variance tradeoff present: when you use the robust standard errors, you eliminate bias in the standard errors, at the cost of maybe more variance in them.

Most people seem to care more about the bias than the variance, but if this tradeoff is important, then the only advice I have to offer is to simulate or bootstrap see how much it might matter in your application. There's probably some additional theory extant or to be developed that could offer some advice by using higher-order asymptotics, but that's beyond my paygrade.

$^1$ Proof here, apparently originally due to Aitchen.

Solved – Standard Errors with Weighted Least Squares Regression

Essentially you already computed everything you need. The missing piece is just that the sig_i should be the residual standard error divided by the corresponding square root of the weight. In OLS this isn't necessary because all weights are 1.

sig_i <- resid_var2 / sqrt(wts)
var_betas2 <- solve(t(X) %*% W %*% X) %*% (t(X) %*% W %*% diag(sig_i^2) %*% t(W) %*% X) %*% solve(t(X) %*% W %*% X)

And then you get:

sqrt(diag(var_betas2))
##                   z 
## 0.1760843 0.2508150

which matches the output of summary() and vcov():

sqrt(diag(vcov(lm_wls)))
## (Intercept)           z 
##   0.1760843   0.2508150

Even more familiar might be the equations as $\hat \sigma^2 (X^\top W X)^{-1}$ where the terms from the full sandwich (= bread * meat * bread) have already been simplified to just the bread:

var_betas2a <- resid_var2^2 * solve(t(X) %*% W %*% X)
sqrt(diag(var_betas2a))
##                   z 
## 0.1760843 0.2508150

Best Answer

Related Solutions

Solved – Using Weighted Least Squares with Robust Standard Errors

Solved – Standard Errors with Weighted Least Squares Regression

Related Question