Solved – Standard errors in weighted least squares on aggregated data

aggregationregressionstandard errorweighted-regression

I am interested – mostly just for my own knowledge, and not for any real problem – in the use of weighted least squares to estimate a model on individual-level data and aggregated versions of those same data. Here's some simple R code to show what I'm talking about. Basically, I estimate a regression model lm1 on some data (old), then I aggregate the data (to create new) so that I only have one observation for each combination of $X1$, $X2$, and $Y$, and weight based on the number of observations at each combination of $X1$, $X2$, and $Y$.

The resulting coefficients in a regression (lm2) on new are the same as those in lm1, but the standard errors are larger. But they are larger by a fixed amount (i.e., the SEs from lm1 are proportionate to one another). How do I figure out what that ratio is? I've searched around for calculations of SEs for weighted least squares, but that has proved unhelpful.

set.seed(1)
n <- 1000
old <- data.frame(x1=sample(1:3,n,TRUE), x2=sample(1:3,n,TRUE))
old$y <- old$x1 + old$x2 + sample(-3:3,n,TRUE)
    old$i <- with(old, interaction(x1,x2,y, drop=TRUE))
levels(old$i) <- seq_along(levels(old$i))
s <- split(old, old$i)
    new <- do.call(rbind, lapply(s, `[`, 1, 1:4))
    new$w <- sapply(s, nrow)

lm1 <- lm(y ~ x1 + x2, data=old)
lm2 <- lm(y ~ x1 + x2, data=new, weights=w)

# coefficients equal?
all.equal(coef(lm1),coef(lm2))
## [1] TRUE

# ratio of SEs
summary(lm1)$coef[,2]/summary(lm2)$coef[,2]
## (Intercept)          x1          x2 
##   0.2453172   0.2453172   0.2453172

Why are the SEs from the aggregated model proportionate to those from the individual-level model? And how do I know what their ratio is?

Best Answer

The value of the ratio can be obtained as follows, noting that the discrete sampling in lines 3 and 4 of the code implies 3 options for $X1$, 3 for $X2$, and 7 for $Y$ conditional on each combination of $X1$ and $X2$:

$(N – K)_{old}$ = 1000 – 3 = 997

$(N – K)_{new}$ = (3 x 3 x 7) – 3 = 63 – 3 = 60

Ratio of standard errors = $\sqrt\frac{(N – K)_{new}}{(N – K)_{old}} = \sqrt\frac{60}{ 997} = 0.2453172$

Why is $(N – K)$ relevant here? Because it is the denominator in the formula for the OLS estimator $s^2$ of the variance parameter $\sigma_0^2$:

$s^2 = \frac{SSR}{N – K}$

where SSR is the sum of squared residuals. This in turn feeds into the estimator of the variance of the coefficient vector $\sigma_0^2$, with $X$ as the matrix of independent variables:

$V[\hat\beta| X] = \sigma_0^2.(X’X)^{-1}$ estimated by $s^2.(X’X)^{-1}$

and so into the standard errors of the coefficients. These formulae also apply to the WLS coefficients provided that SSR and X are based on the weighted variables (WLS being equivalent to OLS on weighted variables).

However, it is important to note (especially for anyone who may be using regression with aggregate data in real applications such as in economics) that the simple ratio formula above only works because of particular features of this case. In general the effect of aggregation on standard errors is more complex, with effects via the SSR and $X’X$ (which happen to cancel out in this case) needing to be considered as well as those via $(N – K)$.

A relevant feature of this case is that aggregation does not group together different $Y$ values, each combination of $X1$, $X2$ and $Y$ forming a separate aggregation. Thus there is no averaging of $Y$ values which would tend to reduce the residuals. Suppose, by contrast, that a sample has two observations of $y$ for each observed $x$ value and that in each case the two observations happen to lie on opposite sides of the fitted line, but the same distance from it. Then regression on the unaggregated data will produce non-zero standard errors of the coefficients, but aggregation at each $x$ value (that is, averaging of its two $y$ observations) will produce a perfect fit with zero residuals and therefore zero standard errors. In that case, therefore, the zero SSR in the aggregate model will dominate any effects via $(N – K)$ and $X’X$.

Related Question