Having an unbalanced panel is not a problem nowadays. In the past, when econometrics had to be done by hand, inverting matrices for unbalanced panels was more difficult but for computers this is not a problem. The only worry connected today with this is the question why the panel is unbalanced: is it due to attrition? If yes, is this attrition random or related to characteristics of the statistical units? For instance, in surveys people with higher education tend to be more responsive and stay in the panel longer for that reason.
Regarding the fixed effects model, have you checked whether the variables that are time-invariant in theory are actual not varying over time? Sometimes coding errors sneak in and then all the sudden a variable varies over time when it shouldn't. One way of checking this is to use the xtsum
command which displays overall, between, and within summary statistics. The time-invariant variables should have a zero within standard deviation. If they don't then something went wrong in the coding.
Having a negative Hausman test statistics is a bad thing because the matrices that the test is built on are positive semi-definite and therefore the theoretical values of the test are positive. Negative values point towards model misspecification or a too small sample (related to this is this question).
If you cluster your standard errors you also need a modified version of the Hausman test. This is implemented in the xtoverid
command. You can use it like this:
xtreg ln_r_prisperkg_Frst_102202 Dflere_mottak_tur i.landingsfylkekode i.kvartiler_ny markedsk_torsk gjenv_TAC_NØtorsk_år_prct lalder_fartøy i.fangstr r_minst_Frst_torsk gjenv_kvote_NØtorsk_fartøy_prct i.lengde_gruppering mobilitet, fe vce(cluster fartyid)
xtoverid
Rejecting the null rejects the validity of the assumptions underlying the random effects mode.
The xtset
command only takes into account the unit id for fixed effects estimation. The time variable does not eliminate time fixed effects. So if you do
xtset id time
xtreg y x, fe
will give you the exact same results as
xtset id
xtreg y x, fe
The time variable is only specified for commands for which the sorting order of the data matters, for instance xtserial
which tests for panel autocorrelation requires this. This has been discussed here. So if you want to include time fixed effects, you need to include the day dummies separately via i.day
, for example. In this context, the season and year dummies make sense so it's good that you use them.
Below are my responses to your two questions.
Don't these models suffer from omitted variable bias?
- ALL regression based models including time series suffer from omitted
variable bias unless regression(including time series) is based on
randomized experimental data.
What allows time series studies to use only one independent variable as compared to cross sectional and panel studies that rarely ever use less than 2 independent variables? Do time series models have a property that allow researchers to use just one independent variable?
- You could use one independent variable in Time series like any other
non-timeseries regression. There is nothing special in time series,
that allows you to use only one independent variable. You can use as many or less number of independent variable.
With regards to your specific question on number of independent variables, according to this wonderful article:
"And do not try to estimate relationships for more than three
variables in a regression (findings from Goldstein and Gigerenzer,
2009, are consistent with this rule-of-thumb)"
The same article also provides a real world example of missing variable bias. Bottom line, use domain knowledge, available literature, experimental evidence, experts to select number of variables.
In addition, I would use Transfer Function within ARIMA framework which is a general form of ARIMA and incorporates AR/ARMA.ARDL and other time series regression.
Best Answer
I find your question highly interesting, since I myself have had the same doubts. Here are some of my thoughts...
In a panel study fixed effects control for every variable that is constant over time, i.e sex, and the stable part (the mean) of every variable that changing. That leaves the part of the variable that changes as an independent variable if the variable is in the model or in the error term (the idiosyncratic error) if the variable is unobserved. This is thought to reduce bias in the model.
Graph theory is a modern school of causality. Graph theory have established rules when a variable should be controlled for. (See Elvert & Winship, 2014, Endogenous Selection Bias) Lets say we are interested in if variable X causes variable Y. If a variable Z causes both X and Y, then Z will cause the relationship X -> Y to be biased. This is solved by conditioning on Z in our regression. Now lets say that we have the same variables, however Z is now not the cause for both X and Y but instead caused by them. The correlations is exactly the same as above, it is just the causal arrow that is reversed. In this case Z should be left out of the equation. If Z is included this will invite bias into the model. Finally, what about if Z is caused by X and in turn causes Y? In this case a part of the total effect that X have on Y will be indirect through Z. If we control for Z by including it in our model we only estimate the direct effect of X on Y. However, if we wanted to know the total effect of X on Y then we have invited a bias by including Z in the equation. This is called overcontrol bias.
Fixed effects is by definition the econometric equivalence to an nuclear blast. You clean your model of the mean value of every conceivable variable. This is source of the strength of the method... and its horrible weakness. The model is normally used as a safetly precaution when the researcher does not know if there are any unmeasured confounders. However, by doing so, more bias might be invited into the model than are removed. Furthermore, the stable part of any indirect effect between an independent and dependent variable will be lost, which at least should be recognised by the researcher.
I recommend that you only use fixed effects when you are confident that there are only unobserved variables that cause both the independent and dependent variables, and none variables that are caused by both the independent and dependent variables. In order to make this assessment you need to know which variables are part of the causal system, which is why you need to write your own causal graph before you choose your model. In some cases, you might be better of by using some other method, in other cases, by using regular standard OLS regression and ignoring the potential threat of an unmeasured variable nobody can name anyway.