Solved – What are the consequences of “copying” a data set for OLS

intuitionleast squaresregression

Suppose I have a random sample $\lbrace X_i, Y_i\rbrace_{i=1}^n$. Assume this sample is such that the Gauss-Markov assumptions are satisfied such that I can construct an OLS estimator where

$$\hat{\beta}_1^{OLS} = \frac{\text{Cov}(X,Y)}{\text{Var(X)}}$$
$$\hat{\beta}_0^{OLS} = \bar{Y} – \bar{X} \hat{\beta}_1^{OLS}$$

Now suppose I take my data set and double it, meaning there is an exact copy for each of the $n$ $(X_i,Y_i)$ pairs.

My Question

How does this affect my ability to use OLS? Is it still consistent and identified?

Best Answer

Do you have a good reason to do the doubling (or duplication?) It doesn't make much statistical sense, but still it is interesting to see what happens algebraically. In matrix form your linear model is $$ \DeclareMathOperator{\V}{\mathbb{V}} Y = X \beta + E, $$ the least square estimator is $\hat{\beta}_{\text{ols}} = (X^T X)^{-1} X^T Y $ and the variance matrix is $ \V \hat{\beta}_{\text{ols}}= \sigma^2 (X^t X)^{-1} $. "Doubling the data" means that $Y$ is replaced by $\begin{pmatrix} Y \\ Y \end{pmatrix}$ and $X$ is replaced by $\begin{pmatrix} X \\ X \end{pmatrix}$. The ordinary least squares estimator then becomes $$ \left(\begin{pmatrix}X \\ X \end{pmatrix}^T \begin{pmatrix} X \\ X \end{pmatrix} \right )^{-1} \begin{pmatrix} X \\ X \end{pmatrix}^T \begin{pmatrix} Y \\ Y \end{pmatrix} = \\ (x^T X + X^T X)^{-1} (X^T Y + X^T Y ) = (2 X^T X)^{-1} 2 X^T Y = \\ \hat{\beta}_{\text{ols}} $$ so the calculated estimator doesn't change at all. But the calculated variance matrix becomes wrong: Using the same kind of algebra as above, we get the variance matrix $\frac{\sigma^2}{2}(X^T X)^{-1}$, half of the correct value. A consequence is that confidence intervals will shrink with a factor of $\frac{1}{\sqrt{2}}$.

The reason is that we have calculated as if we still have iid data, which is untrue: the pair of doubled values obviously have a correlation equal to $1.0$. If we take this into account and use weighted least squares correctly, we will find the correct variance matrix.

From this, more consequences of the doubling will be easy to find as an exercise, for instance, the value of R-squared will not change.

Related Question