I'm fairly new to Econometrics, and I'm curious about, when trying to estimate the population parameter, why it is important to take multiple samples of the population, instead of just combining all of those samples into one large sample? E.g. If you were trying to estimate the effect of one extra year of education on wages, and your population is college students, why take 10 samples of 1000 students each instead of just taking 1 sample of 10,000 students? Thanks. (Sorry if this has been answered already, I couldn't find similar questions after doing a search.)
Solved – Why take multiple samples from a population
econometricspopulationsamplesample-sizesampling
Related Solutions
It seems like you're imagining a very simple sampling model.
The simplest model for sampling is called aptly Simple Random Sampling. You select a subset of the population (e.g., by dialing phone numbers at random) and ask whomever answers how they're voting. If 487 say Clinton, 463 say Trump, and the remainder give you some wacky answer, then the polling firm would report that 49% of voters prefer Clinton, while 46% prefer Trump. However, the polling firms do a lot more than this. A simple random sample gives equal weight to every data point. However, suppose your sample contains--by chance--600 men and 400 women, which clearly isn't representative of the population as a whole. If men as a group lean one way, while women lean the other, this will bias your result. However, since we have pretty good demographic statistics, you can weight* the responses by counting the women's responses a bit more and the men's a bit less, so that the weighted response represents the population better. Polling organizations have more complicated weighing models that can make a non-representative sample resemble a more representative one.
The idea of weighting the sampled responses is on pretty firm statistical ground, but there is some flexibility in choosing what factors contribute to the weights. Most pollsters do reweight based on demographic factors like gender, age, and race. Given this, you might think that party identification (Democratic, Republican, etc) should also be included, but it turns out that most polling firms do not use it in their weights: party (self)-identification is tangled up with the voter's choice in a way that makes it less useful.
Many polling outfits also report their results among "likely voters". In these, respondents are either selected or weighted based on the likelihood that they'll actually turn up to the polls. This model is undoubtedly data-driven too, but the precise choice of factors allows for some flexibility. For example, including interactions between the candidate and voter's race (or gender) wasn't even sensible until 2008 or 2016, but I suspect they have some predictive power now.
In theory, you could include all sorts of things as weighting factors: musical preference, eye color, etc. However, demographic factors are popular choices for weighting factors because:
- Empirically, they correlate well with voter behavior. Obviously, there is no iron-clad law that 'forces' white men to be lean Republican, but over the last fifty years, they have tended to.
- The population values are well known (e.g., from the census or Vital Records)
However, pollsters also see the same news everyone else does, and can adjust the weighting variables if necessary.
There are also some "fudge factors" that are sometimes invoked to explain poll results. For example, respondents sometimes are reluctant to give "socially-undesirable" answers. The Bradley Effect posits that white voters sometimes downplay their support for white candidates running against a minority to avoid appearing racist. It is named after Tom Bradley, an African-American gubernatorial candidate who narrowly lost the election despite leading comfortably in the polls.
Finally, you're completely correct that the very act of asking someone's opinion can change it. Polling firms try to write their questions in a neutral way. To avoid issues with the order of possible responses, the candidates' names might be listed in random order. Multiple versions of a question are also sometimes tested against each other. This effect can also be exploited for nefarious ends in a push poll, where the interviewer isn't actually interested in collecting responses but in influencing them. For example, a push poll might ask "Would you vote for [Candidate A] even if it was reported that he was a child molester?".
* You might also set explicit targets for your sample, like including 500 men and 500 women. This is called stratified sampling--the population is stratified into different groups, and each group is then sampled random. In practice, this isn't done very often for polls, because you'd need to stratify into a lot of exhaustive groups (e.g., college-educated men between 18-24 in Urban Texas).
Best Answer
Let's present a case where it is preferable to use the pooled-data estimator. This is related to an old question of mine so some things are repeated.
Pooling the available data of total size $n$ against keeping it separate and work with $m$ smaller samples each of size $n_j = n/m$ has an advantage in the context of OLS regression, as regards the variance of the OLS estimator, at least when the $m$ samples are considered independent.
Consider a linear regression setting (to which the OP alludes), and let's say that all nice properties of the OLS estimator do hold: unbiasedness, efficiency, consistency.
Let's say we have two options: take one sample of size $n$ or $m$ samples each of size $n_j = n/m$. If we take one sample, we will run one regression,
$$\mathbf y = \mathbf X\beta + \mathbf u,\;\; {\rm Var}(\mathbf u\mid X) =\sigma^2\mathbf I$$
and we will obtain a single estimate, $\hat \beta$.
If we take $m$ independent samples we will run $m$ independent regressions,
$$\mathbf y_j = \mathbf X_j\beta + \mathbf u_j, \;\;{\rm Var}(\mathbf u_j \mid X_j) =\sigma^2\mathbf I,\;\; j=1,...m$$
and we will get $m$ estimates $b_j, j=1,...,m$. This looks better, we can obtain an empirical distribution, rather than just a single estimate. But what if we want eventually to obtain, from this route also, a single estimate? It appears that the most natural thing would be to average over the lot, arriving thus at the averaging estimator
$$\bar b = \frac 1m\sum_{j=1}^{m}b_j$$
All estimators involved in both approaches are unbiased and consistent. But informally, in the first case, we appeal to the consistency property of the estimator: by using a large sample we hope that asymptotic properties will affect beneficially the accuracy of the estimate that we will obtain.
In the second case, we appeal to the unbiasedness property of the estimators: obtain many estimates and take the average, estimating the expected value of the estimator which equals the true value.
What happens is that
$${\rm Var}(\hat \beta \mid \mathbf X) < {\rm Var}(\bar b \mid \mathbf X)$$
in the matrix sense, i.e. that the difference ${\rm Var}(\bar b \mid \mathbf X) -{\rm Var}(\hat \beta \mid \mathbf X)$ is a positive definite matrix: The estimator from the single sample has lower variance than the averaging estimator.
Consider first
$${\rm Var}(\bar b \mid \mathbf X) = {\rm Var}\left(\frac 1m\sum_{j=1}^{m}b_j\mid \mathbf X\right) = \frac 1{m^2}\sum_{j=1}^{m}{\rm Var}(b_j\mid \mathbf X) = \sigma^2\frac 1{m^2}\sum_{j=1}^{m}(\mathbf X_j' \mathbf X_j)^{-1}$$
No covariances appear because the samples are assumed independent. Using the symbol $A$ to denote the arithmetic mean, we can write
$${\rm Var}(\bar b\mid \mathbf X) = \frac {\sigma^2}{m}\cdot A\left[\left(\mathbf X_j' \mathbf X_j\right)^{-1};j=1,...,m\right] \tag{1}$$
For the single-sample estimator we have
$${\rm Var}(\hat \beta\mid \mathbf X) = \sigma^2 (\mathbf X' \mathbf X)^{-1}$$
Now, $\mathbf X$ is a matrix that stacks the $\mathbf X_j$ matrices of the $m$ samples,
$$\mathbf X = \left [ \begin{matrix} \mathbf X_1 \\ \mathbf X_2\\ . \\ . \\ \mathbf X_m \end{matrix}\right]$$
Therefore,
$$\mathbf X' \mathbf X = \mathbf X_1' \mathbf X_1 + \mathbf X_2' \mathbf X_2 +...+\mathbf X_m' \mathbf X_m$$
Manipulate this into
$$\mathbf X' \mathbf X = \left[\left(\mathbf X_1' \mathbf X_1\right)^{-1}\right]^{-1} + \left[\left(\mathbf X_2' \mathbf X_2\right)^{-1}\right]^{-1} +...+\left[\left(\mathbf X_m' \mathbf X_m\right)^{-1}\right]^{-1}$$
$$\implies \left(\mathbf X' \mathbf X\right)^{-1} = \left(\left[\left(\mathbf X_1' \mathbf X_1\right)^{-1}\right]^{-1} + \left[\left(\mathbf X_2' \mathbf X_2\right)^{-1}\right]^{-1} +...+\left[\left(\mathbf X_m' \mathbf X_m\right)^{-1}\right]^{-1}\right)^{-1}$$
After our eyes adjust to the three layers of inverses, we can see that the right hand side is the scaled harmonic mean of the $\left(\mathbf X_j' \mathbf X_j\right)^{-1}$ matrices. Note that this is the harmonic mean in the matrix sense of the term.
Using the symbol $H$ to denote the harmonic mean we have
$${\rm Var}(\hat \beta\mid \mathbf X) = \sigma^2 (\mathbf X' \mathbf X)^{-1} = \frac {\sigma^2}{m}\cdot H\left[\left(\mathbf X_j' \mathbf X_j\right)^{-1};j=1,...,m\right] \tag{2}$$
For scalars, it is well known that $H<A$ always. It has been proven that the same holds for matrices too. So we have proven that
$${\rm Var}(\hat \beta\mid \mathbf X) < {\rm Var}(\bar b \mid \mathbf X)$$ again in the matrix sense.
So in this set up, we prefer to pool the data because it lowers the variance.
Note: "pooling the data" does not always produce this beneficial effect. For example, if we want to estimate the mean of a population, the variance of the sample mean from a single-sample of size $n$ is the same as the variance of the average of $m$ sample means from $m$ samples each of size $n/m$.