Solved – Why take multiple samples from a population

econometricspopulationsamplesample-sizesampling

I'm fairly new to Econometrics, and I'm curious about, when trying to estimate the population parameter, why it is important to take multiple samples of the population, instead of just combining all of those samples into one large sample? E.g. If you were trying to estimate the effect of one extra year of education on wages, and your population is college students, why take 10 samples of 1000 students each instead of just taking 1 sample of 10,000 students? Thanks. (Sorry if this has been answered already, I couldn't find similar questions after doing a search.)

Best Answer

Let's present a case where it is preferable to use the pooled-data estimator. This is related to an old question of mine so some things are repeated.

Pooling the available data of total size $n$ against keeping it separate and work with $m$ smaller samples each of size $n_j = n/m$ has an advantage in the context of OLS regression, as regards the variance of the OLS estimator, at least when the $m$ samples are considered independent.

Consider a linear regression setting (to which the OP alludes), and let's say that all nice properties of the OLS estimator do hold: unbiasedness, efficiency, consistency.

Let's say we have two options: take one sample of size $n$ or $m$ samples each of size $n_j = n/m$. If we take one sample, we will run one regression,

$$\mathbf y = \mathbf X\beta + \mathbf u,\;\; {\rm Var}(\mathbf u\mid X) =\sigma^2\mathbf I$$

and we will obtain a single estimate, $\hat \beta$.

If we take $m$ independent samples we will run $m$ independent regressions,

$$\mathbf y_j = \mathbf X_j\beta + \mathbf u_j, \;\;{\rm Var}(\mathbf u_j \mid X_j) =\sigma^2\mathbf I,\;\; j=1,...m$$

and we will get $m$ estimates $b_j, j=1,...,m$. This looks better, we can obtain an empirical distribution, rather than just a single estimate. But what if we want eventually to obtain, from this route also, a single estimate? It appears that the most natural thing would be to average over the lot, arriving thus at the averaging estimator

$$\bar b = \frac 1m\sum_{j=1}^{m}b_j$$

All estimators involved in both approaches are unbiased and consistent. But informally, in the first case, we appeal to the consistency property of the estimator: by using a large sample we hope that asymptotic properties will affect beneficially the accuracy of the estimate that we will obtain.

In the second case, we appeal to the unbiasedness property of the estimators: obtain many estimates and take the average, estimating the expected value of the estimator which equals the true value.

What happens is that
$${\rm Var}(\hat \beta \mid \mathbf X) < {\rm Var}(\bar b \mid \mathbf X)$$

in the matrix sense, i.e. that the difference ${\rm Var}(\bar b \mid \mathbf X) -{\rm Var}(\hat \beta \mid \mathbf X)$ is a positive definite matrix: The estimator from the single sample has lower variance than the averaging estimator.

Consider first

$${\rm Var}(\bar b \mid \mathbf X) = {\rm Var}\left(\frac 1m\sum_{j=1}^{m}b_j\mid \mathbf X\right) = \frac 1{m^2}\sum_{j=1}^{m}{\rm Var}(b_j\mid \mathbf X) = \sigma^2\frac 1{m^2}\sum_{j=1}^{m}(\mathbf X_j' \mathbf X_j)^{-1}$$

No covariances appear because the samples are assumed independent. Using the symbol $A$ to denote the arithmetic mean, we can write

$${\rm Var}(\bar b\mid \mathbf X) = \frac {\sigma^2}{m}\cdot A\left[\left(\mathbf X_j' \mathbf X_j\right)^{-1};j=1,...,m\right] \tag{1}$$

For the single-sample estimator we have

$${\rm Var}(\hat \beta\mid \mathbf X) = \sigma^2 (\mathbf X' \mathbf X)^{-1}$$

Now, $\mathbf X$ is a matrix that stacks the $\mathbf X_j$ matrices of the $m$ samples,

$$\mathbf X = \left [ \begin{matrix} \mathbf X_1 \\ \mathbf X_2\\ . \\ . \\ \mathbf X_m \end{matrix}\right]$$

Therefore,

$$\mathbf X' \mathbf X = \mathbf X_1' \mathbf X_1 + \mathbf X_2' \mathbf X_2 +...+\mathbf X_m' \mathbf X_m$$

Manipulate this into

$$\mathbf X' \mathbf X = \left[\left(\mathbf X_1' \mathbf X_1\right)^{-1}\right]^{-1} + \left[\left(\mathbf X_2' \mathbf X_2\right)^{-1}\right]^{-1} +...+\left[\left(\mathbf X_m' \mathbf X_m\right)^{-1}\right]^{-1}$$

$$\implies \left(\mathbf X' \mathbf X\right)^{-1} = \left(\left[\left(\mathbf X_1' \mathbf X_1\right)^{-1}\right]^{-1} + \left[\left(\mathbf X_2' \mathbf X_2\right)^{-1}\right]^{-1} +...+\left[\left(\mathbf X_m' \mathbf X_m\right)^{-1}\right]^{-1}\right)^{-1}$$

After our eyes adjust to the three layers of inverses, we can see that the right hand side is the scaled harmonic mean of the $\left(\mathbf X_j' \mathbf X_j\right)^{-1}$ matrices. Note that this is the harmonic mean in the matrix sense of the term.

Using the symbol $H$ to denote the harmonic mean we have

$${\rm Var}(\hat \beta\mid \mathbf X) = \sigma^2 (\mathbf X' \mathbf X)^{-1} = \frac {\sigma^2}{m}\cdot H\left[\left(\mathbf X_j' \mathbf X_j\right)^{-1};j=1,...,m\right] \tag{2}$$

For scalars, it is well known that $H<A$ always. It has been proven that the same holds for matrices too. So we have proven that

$${\rm Var}(\hat \beta\mid \mathbf X) < {\rm Var}(\bar b \mid \mathbf X)$$ again in the matrix sense.

So in this set up, we prefer to pool the data because it lowers the variance.

Note: "pooling the data" does not always produce this beneficial effect. For example, if we want to estimate the mean of a population, the variance of the sample mean from a single-sample of size $n$ is the same as the variance of the average of $m$ sample means from $m$ samples each of size $n/m$.

Related Question