Solved – Omitted variable bias in time series

biastime series

This is a brief question because my lecturer mentioned it today in class but I don't quite understand. Why is omitted variable bias not a major problem in time series analysis?

Best Answer

My guess would be that, in econometrics at least, cross-sectional studies are often trying to get at causal relationships; the causal effect of an additional year of education on earnings, for example.

In time series, we are usually only after a prediction for future values of our outcome. Given some set of factors, what do we expect the GDP growth rate to be next quarter? We don't care if the measure of consumer confidence that we include directly causes growth, but are instead content to use its predictive power (as opposed to true explanatory power) to help with our forecast.

So perhaps positive news stories cause consumer confidence, which leads to economic growth. Leaving out a measure of the positivity of news stories would lead to omitted variables bias in that the coefficient on confidence isn't really a measure of the effect of confidence itself. But we are still able to get useful forecasts despite the omitted variable.

Related Solutions

Solved – Time-Series Data and Omitted Variable Bias

Econometric models often try to determine causality in a time series context. For example, studying the impact of tax rate changes on economic growth.

Panel data, a time series of cross-section data sets, is often employed to estimate causal effects. Difference-in-difference estimators are commonly employed here for precisely this exercise (essentially, omitted variables are differenced out).

To answer your direct question, yes, we would be concerned about omitted variables when we are trying to determine causal links. If these variables are correlated with our treatment variable, then we can get a biased estimate of the causal effect.

For prediction, we aren't looking to ascribe causal links and thus omitted variables bias may be less of a concern. Knowing that the number of people carrying umbrellas is a good predictor of whether or not it will rain in the afternoon is useful enough for me to forecast the weather, but I can't explain it.

The explain-predict difference is key if the omitted factor is changing over time. If people decide that they don't want to be exposed to the sun and start carrying umbrellas on rainy and sunny days, then my forecasting ability breaks down. Without a causal factor, I can't anticipate these model failures. This is how omitted variables bias can be important even "just" for forecasting.

Given that the model doesn't break down, I might be able to generate better predictions using non-causal factors in addition to causal ones. Whatever I can get my hands on to reduce my prediction variance.

Omitted Variable Bias in Linear Regression – What It Is and How to Address It?

The main issue here is the nature of the omitted variable bias. Wikipedia states:

Two conditions must hold true for omitted-variable bias to exist in linear regression:

the omitted variable must be a determinant of the dependent variable (i.e., its true regression coefficient is not zero); and

the omitted variable must be correlated with one or more of the included independent variables (i.e. cov(z,x) is not equal to zero).

It's important to carefully note the second criterion. Your betas will only be biased under certain circumstances. Specifically, if there are two variables that contribute to the response that are correlated with each other, but you only include one of them, then (in essence) the effects of both will be attributed to the included variable, causing bias in the estimation of that parameter. So perhaps only some of your betas are biased, not necessarily all of them.

Another disturbing possibility is that if your sample is not representative of the population (which it rarely really is), and you omit a relevant variable, even if it's uncorrelated with the other variables, this could cause a vertical shift which biases your estimate of the intercept. For example, imagine a variable, $Z$, increases the level of the response, and that your sample is drawn from the upper half of the $Z$ distribution, but $Z$ is not included in your model. Then, your estimate of the population mean response (and the intercept) will be biased high despite the fact that $Z$ is uncorrelated with the other variables. Additionally, there is the possibility that there is an interaction between $Z$ and variables in your model. This can also cause bias without $Z$ being correlated with your variables (I discuss this idea in my answer here.)

Now, given that in its equilibrium state, everything is ultimately correlated with everything in the world, we might find this all very troubling. Indeed, when doing observational research, it is best to always assume that every variable is endogenous.

There are, however, limits to this (c.f., Cornfield's Inequality). First, conducting true experiments breaks the correlation between a focal variable (the treatment) and any otherwise relevant, but unobserved, explanatory variables. There are some statistical techniques that can be used with observational data to account for such unobserved confounds (prototypically: instrumental variables regression, but also others).

Setting these possibilities aside (they probably do represent a minority of modeling approaches), what is the long-run prospect for science? This depends on the magnitude of the bias, and the volume of exploratory research that gets done. Even if the numbers are somewhat off, they may often be in the neighborhood, and sufficiently close that relationships can be discovered. Then, in the long run, researchers can become clearer on which variables are relevant. Indeed, modelers sometimes explicitly trade off increased bias for decreased variance in the sampling distributions of their parameters (c.f., my answer here). In the short run, it's worth always remembering the famous quote from Box:

All models are wrong, but some are useful.

There is also a potentially deeper philosophical question here: What does it mean that the estimate is being biased? What is supposed to be the 'correct' answer? If you gather some observational data about the association between two variables (call them $X$ & $Y$), what you are getting is ultimately the marginal correlation between those two variables. This is only the 'wrong' number if you think you are doing something else, and getting the direct association instead. Likewise, in a study to develop a predictive model, what you care about is whether, in the future, you will be able to accurately guess the value of an unknown $Y$ from a known $X$. If you can, it doesn't matter if that's (in part) because $X$ is correlated with $Z$ which is contributing to the resulting value of $Y$. You wanted to be able to predict $Y$, and you can.

Best Answer

Related Solutions

Solved – Time-Series Data and Omitted Variable Bias

Omitted Variable Bias in Linear Regression – What It Is and How to Address It?

Related Question