Solved – The distinction between stochastic independent variable and measurement error in independent OLS variable

assumptionsleast squaresregressionstochastic-processes

Assume that OLS regression of the form:

$$Y_t = X_t'\beta + u_t$$

Suppose $X_t$ are stochastic, thus standard Gauss-Markov assumptions need to be accommodated. Given that:

$$\text{E} {(\hat\beta)} = \beta + \text{E}((X'X)^{-1}X'u)$$

Now, for OLS to be unbiased we need to additionally assume no covariance between the error term and the Xs. (Even more correctly, that they are independent).

Problem: Isn't assuming that $X_t$ is random similar to assuming that there is measurement error in X (That is $X_t = X_t^* + h_t$), where $h_t$ is a random white noise process. Now, this type of random measurement error in fact guarantees that the error term and independent variables are correlated and the beta parameters are downwardly biased. So clearly these are distinct cases, what is the difference? I don't see how a random variable could not be correlated with the error. Would it be correct to assume that the assumptions break down and betas are biased downward in practice, when random variables are introduced?

Best Answer

You ask "isn't assuming that $X$ is random similar to assuming that there is measurement error in $X$", in the sense that it biases $\hat{\beta}$?

The answer is no. Here's a little R simulation you can play with to help convince yourself:

get_df <- function(n_obs=10^3, true_beta=c(5, -1, 10, 5)) {
    stopifnot(length(true_beta) == 4)  # Coefficients on constant, x1, x2, x3
    df <- data.frame(x1=rnorm(n_obs), x2=rnorm(n_obs), epsilon=rnorm(n_obs), constant=1)
    df$x3 <- 2*df$x1 + df$x2 + rnorm(n_obs)
        df$y <- as.matrix(df[, c("constant", "x1", "x2", "x3")], n_obs, 4) %*% true_beta + df$epsilon
        df$x1_noisy <- df$x1 + rnorm(n_obs, sd=5)
    return(df)
}

get_beta_hat <- function(df, formula=y ~ 1 + x1 + x2 + x3) {
    fit <- lm(formula, data=df)
    return(coefficients(fit))
}

set.seed(543299)

beta_hat <- t(replicate(1000, get_beta_hat(get_df())))
colMeans(beta_hat)  # Very close to true values of (5, -1, 10, 5), even with stochastic X
apply(beta_hat, MARGIN=2, FUN=sd)

beta_hat_noisy_x1 <- t(replicate(1000, get_beta_hat(get_df(), y ~ 1 + x1_noisy + x2 + x3)))
colMeans(beta_hat_noisy_x1)  # I got (5, -0.01, 10.4, 4.6)
apply(beta_hat, MARGIN=2, FUN=sd)

The code simulates data where $$Y \equiv X\,\beta + \epsilon,$$with $X$ and $\epsilon$ (and therefore $Y$) random. Since $$\mathbb{E}\left[\epsilon | X\right] = 0,$$we have $\mathbb{E}\left[\hat{\beta} | X\right] = \beta$, i.e. our estimates of $\beta$ are unbiased conditional on any realization of $X$. That implies that they are unbiased unconditionally: $$\mathbb{E}\left[\hat{\beta}\right] = \mathbb{E}\left[\mathbb{E}\left[\hat{\beta} | X\right]\right] = \beta.$$

The issue with measurement error -- the reason it's different from randomness in $X$ -- is that measurement error does not enter the definition of $Y$. In other words, if you regress $Y$ on noisy $X$, you are fitting the wrong model; the true model regresses $Y$ on non-noisy (but possibly random) $X$.

Related Question