Solved – Validity of pseudo-panel data constructed from repeated cross sectional data as a panel data

cross-sectionpanel datavalidation

I am looking at the repeated cross-sectional data from federal reserves, which has both panel data and repeated cross sectional data at different time-points,e.g. 2007-2009 is a panel while 2010 is a cross sectional data set and everything before that is repeated cross section as well until you get back to the 1983-1989 period which is also a panel. I want to use recent data-sets like 2001 – 2009, of which only the last two years will be true panel data.

RCS data is considered to be inferior to true panel data in general in the sense that in the former case, the same individuals are not followed over time, thus making individual histories unobtainable to include in a model. However, Several authors such as Deaton (1985), Moffitt (1990,1993) showed that the RCS data can be used to estimate a few commonly used models such as the fixed effects model or the linear dynamic model. These methods are based on grouping “similar” individuals in cohorts and the ‘cohort-averages’ are treated as observations from a pseudo-panel. Note that, all the prior studies were conducted on repeated cross-sections without the panel part.

Now, my first question is, 'Is there any known method to compare pseudo-panel data to a panel data?'. My idea is to fit a model to the synthetic panel data and estimate the parameters, then fit the same model to genuine panel data and compare the estimation accuracy. Does this sound correct? Of course, I want ideas about how much of it is doable. (Please note that I have limited ideas about how to manage a huge data-set like the ones available in fedres website.)

Best Answer

I do not know whether there are established methods to compare panel data to repeated cross-sectional data. But I want to add that true panel data is not always superior to repeated cross-sectional data in general. Attrition or learning effects for example may be a problem in panel data but not in repeated cross-sectional data although I do not know whether these problems are present in your case. But if this is the case, the second and third years (and so on) of your panel data may be problematic compared to repeated cross-sectional data in some sense. You should keep this in mind.

In general I think what you want to do sounds doable and it could reveal new information in comparison with the analysis of cross-sectional data only (although I do not know your research question).

If the estimations differ between both analyses I would have a look whether what could be the reasons by looking at the advantages and disadvantes of both types of datasets. There are several papers about the this topic which might help you such as

Deaton (1985)

Verbeek & Nijman (1992)

Frees (2004)

Lee & Niemeier (1996)

Hsiao (2007)

Related Solutions

Solved – How to cope with serial correlation and time effects in a panel data model in R

Your question is not very clear, and the link to the data is no longer working...

For the time fixed effects, your call should look like this:

fixed <- plm(Price ~ Income + Housing_units + Population_age + 
   Population_density + Unemployment + Real_mortgage_rate + Expected_GDP_growth,
   data=df, index=c("Id", "Year"), model="within", effect="time")

If you want both individual and time FEs you can also use effect="twoways".

To deal with serial correlation you can use vcovHC.plm(), which by default computes SEs clustered by group, i.e. robust wrt heteroscedasticity and arbitrary correlations within the clusters. See Chapter 14.4 of Using R for Introductory Econometrics (Heiss, 2016). (You can also read it online.) To obtain robust SEs is easy:

require(lmtest)
coeftest(fixed, vcov. = vcovHC)

All of this is discussed in the plm vignette:

http://cran.at.r-project.org/web/packages/plm/vignettes/plm.pdf

Solved – Panel data: difference between time effects and cross-sectional dependence

I'm just thinking out loud here,

Suppose you have industry-county-year level data, your outcome is $Y_{ict}$, and you are interested in the effect of some variable $x_{ict}$.

In your strategy you would correctly think you can use:

(1) industry-county (panel) fixed effects to control for time invariant confounding factors across these panels as well as the average difference in time varying covariates across industry-county pairs

(2) year fixed effects to control for shocks that are common to all industries and counties in a given year

However what if there are shocks that are common across some counties in regions indexed by $r$, yet are both time varying and different across regions?

That is, perhaps the true data generating process is

$Y_{ict}=\underbrace{\theta_{ic}}_\text{panel fixed effect}+\underbrace{\theta_t}_\text{year fixed effect}+\underbrace{\theta_{rt}}_\text{regional shocks}+\underbrace{\beta}_\text{parameter of interest} X_{ict}+\underbrace{\epsilon_{ict}}_\text{idiosyncratic shock}$

But you estimate a model

$Y_{ict}=\theta_{ic}+\theta_t+\beta X_{ict}+\epsilon_{ict}$

which does not attempt to proxy for this regional shock, then,to to the degree that $Cov(\theta_{rt},X_{ict})\neq 0$, I believe your estimate $\hat{\beta}$ would in part reflect the variation in $\theta_{rt}$ that covaries with $X_{ict}$.

That is,

$plim \; \hat{\beta} =\underbrace{ \beta}_\text{true parameter} + \underbrace{\frac{Cov(X_{ict},\theta_{rt})}{Var(X_{ict}}}_\text{bias}$

to solve this I believe it is possible that you could

(1) Cluster your standard errors at the geographical level where you think there may be correlated disturbances

and

2) Find an instrument $Z_{ict}$ for $X_{ict}$ that is strongly correlated with $X_{ict}$ (relevant) that has an effect on the outcome only through its effect on $X_{ict}$ and not through $\theta_{rt}$ influencing $Z_{ict}$ or through $Z_{ict}$ influencing $Y_{ict}$ directly (excludibility).