Solved – The distinction between stochastic independent variable and measurement error in independent OLS variable

Assume that OLS regression of the form:

$$Y_t = X_t'\beta + u_t$$

Suppose $X_t$ are stochastic, thus standard Gauss-Markov assumptions need to be accommodated. Given that:

$$\text{E} {(\hat\beta)} = \beta + \text{E}((X'X)^{-1}X'u)$$

Now, for OLS to be unbiased we need to additionally assume no covariance between the error term and the Xs. (Even more correctly, that they are independent).

Problem: Isn't assuming that $X_t$ is random similar to assuming that there is measurement error in X (That is $X_t = X_t^* + h_t$), where $h_t$ is a random white noise process. Now, this type of random measurement error in fact guarantees that the error term and independent variables are correlated and the beta parameters are downwardly biased. So clearly these are distinct cases, what is the difference? I don't see how a random variable could not be correlated with the error. Would it be correct to assume that the assumptions break down and betas are biased downward in practice, when random variables are introduced?

get_df <- function(n_obs=10^3, true_beta=c(5, -1, 10, 5)) { stopifnot(length(true_beta) == 4) # Coefficients on constant, x1, x2, x3 df <- data.frame(x1=rnorm(n_obs), x2=rnorm(n_obs), epsilon=rnorm(n_obs), constant=1) df$x3 <- 2*df$x1 + df$x2 + rnorm(n_obs) df$y <- as.matrix(df[, c("constant", "x1", "x2", "x3")], n_obs, 4) %*% true_beta + df$epsilon df$x1_noisy <- df$x1 + rnorm(n_obs, sd=5) return(df) } get_beta_hat <- function(df, formula=y ~ 1 + x1 + x2 + x3) { fit <- lm(formula, data=df) return(coefficients(fit)) } set.seed(543299) beta_hat <- t(replicate(1000, get_beta_hat(get_df()))) colMeans(beta_hat) # Very close to true values of (5, -1, 10, 5), even with stochastic X apply(beta_hat, MARGIN=2, FUN=sd) beta_hat_noisy_x1 <- t(replicate(1000, get_beta_hat(get_df(), y ~ 1 + x1_noisy + x2 + x3))) colMeans(beta_hat_noisy_x1) # I got (5, -0.01, 10.4, 4.6) apply(beta_hat, MARGIN=2, FUN=sd)

Best Answer

You ask "isn't assuming that $X$ is random similar to assuming that there is measurement error in $X$", in the sense that it biases $\hat{\beta}$?

The answer is no. Here's a little R simulation you can play with to help convince yourself:

The code simulates data where $$Y \equiv X\,\beta + \epsilon,$$with $X$ and $\epsilon$ (and therefore $Y$) random. Since $$\mathbb{E}\left[\epsilon | X\right] = 0,$$we have $\mathbb{E}\left[\hat{\beta} | X\right] = \beta$, i.e. our estimates of $\beta$ are unbiased conditional on any realization of $X$. That implies that they are unbiased unconditionally: $$\mathbb{E}\left[\hat{\beta}\right] = \mathbb{E}\left[\mathbb{E}\left[\hat{\beta} | X\right]\right] = \beta.$$

The issue with measurement error -- the reason it's different from randomness in $X$ -- is that measurement error does not enter the definition of $Y$. In other words, if you regress $Y$ on noisy $X$, you are fitting the wrong model; the true model regresses $Y$ on non-noisy (but possibly random) $X$.

Best Answer

Related Solutions

Solved – Omitted variable bias and the constant term

Solved – What are the consequences of “copying” a data set for OLS

Related Question