Pooled Time Series Regression in R – Techniques and Applications

poolingrregressiontime series

I've got data on mail volume sent by household for seven age groups, with 12 years of data for each age group. I originally ran a simple regression on each age group individually and realized I needed to dig deeper. My aim now is to pool the data (giving me 84 observations) and try to identify some period effects (or year effects, whichever you prefer). My pooled data are currently organized like this (PPHPY stands for Pieces per Household Per Year):

Age Group    Year   PPHPY
1            2001   127.62
1            2002   144.47
1            2003   111.70
1            2004   95.96
1            2005   96.46
1            2006   139.91
1            2007   85.52
1            2008   75.43
1            2009   109.34
1            2010   53.16
1            2011   64.09
1            2012   50.94        
2            2001   176.48
2            2002   172.86
2            2003   137.79
.              .      .
.              .      .
.              .      .
7            2012   163.39

I first regressed PPHPY on year and year dummies (leaving the intercept as 0 to avoid perfect multicollinearity). This gave me period effects for the aggregated data (ie something like a period effect across all age groups, I think). This looked like the following:

> ## Generate YearDummy using factor()
>
> YearDummy <- factor(YearVar)
>
> ## Check to see that YearDummy is indeed a factor variable
>
> is.factor(YearDummy)
[1] TRUE
>
> ## (...+0) ensures intercept is left out and thus YearDummy1 remains in.
    ## One or the other must be subtracted out to avoid perfect mutlicollinearity
>
> LSDVYear <- lm(PPHPY ~ YearVar + YearDummy + 0, data=maildatapooled)
> summary(LSDVYear)
Call:
lm(formula = PPHPY ~ YearVar + YearDummy + 0, data = maildatapooled)
Residuals:
Min 1Q Median 3Q Max
-99.658 -39.038 8.814 43.670 82.300
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
YearVar 5.743e-02 9.851e-03 5.830 1.45e-07 ***
YearDummy2001 1.099e+02 2.795e+01 3.930 0.000193 ***
YearDummy2002 1.209e+02 2.796e+01 4.324 4.85e-05 ***
YearDummy2003 7.791e+01 2.797e+01 2.786 0.006819 **
YearDummy2004 8.053e+01 2.797e+01 2.879 0.005251 **
YearDummy2005 6.887e+01 2.798e+01 2.461 0.016236 *
YearDummy2006 6.572e+01 2.799e+01 2.348 0.021618 *
YearDummy2007 5.975e+01 2.799e+01 2.134 0.036210 *
YearDummy2008 5.836e+01 2.800e+01 2.084 0.040696 *
YearDummy2009 4.119e+01 2.801e+01 1.471 0.145745
YearDummy2010 3.056e+01 2.801e+01 1.091 0.278990
YearDummy2011 1.472e+01 2.802e+01 0.525 0.600951
YearDummy2012 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 52.44 on 72 degrees of freedom
Multiple R-squared: 0.9316, Adjusted R-squared: 0.9202
F-statistic: 81.71 on 12 and 72 DF, p-value: < 2.2e-16

What I want, however, is to tease out period effects for each age group individually. This is what I'm not sure how to set up. I was hoping someone might help me devise some code in R that would kick out those period effects for each of the seven age groups using the pooled data, as well as help me understand the problem conceptually.

EDIT: I forgot to mention that I see I must include an interaction term involving the time dummies to allow the coefficients to vary across age groups. I'm just having difficulty constructing the proper interaction term and resulting regression equation.

EDIT 2: I came up with two models and ran them. I felt like the question had evolved at this point and might merit a new post, which is can be found here.

Best Answer

This doesn't exactly answer what are you looking for. However, I think you need to consider these issues before doing the analysis. In other words, I focus more on the conceptual part (part a) [I thought it to include in comment but this is very long]

With the repeated cross section data, you have here the pseudo panel data (with age group (age cohort) acting as individual effect and year as time effect) and not the true panel data. Theoretical stuffs on this has been discussed here. The good news is that you can use the fixed effect, random effect,and pooled estimator as being used for the true panel data. The use of different tests to choose one over another estimator has been discussed in the plm package manual. However, it is good to start with Introductory Econometrics by Woolridge if you are beginners in panel data. That being said, consider this Oxford discussion paper as a starting one. Once you understand the concept mentioned in the paper, it won't be difficult to apply in R. Your question doesn't spell out other variables. Is that the case?