Handling Lagging in Grouped Time Series Data

lagsrregressiontime series

I have a few tens of thousands of observations that are in a time series but grouped by locations. For example:

location date     observationA observationB
---------------------------------------
 A       1-2010   22           12
 A       2-2010   26           15
 A       3-2010   45           16
 A       4-2010   46           27
 B       1-2010   167          48
 B       2-2010   134          56
 B       3-2010   201          53
 B       4-2010   207          42

I want to see if month x's observationA has any linear relationship with month x+1's observationB.

I did some research and found a zoo function, but it doesn't appear to have a way to limit the lag by group. So if I used zoo and lagged observationB by 1 row, I'd end up with the location A's last observationB as location B's first observationB. I'd rather have the first observationB of any location to be NA or some other obvious value to indicate "don't touch this row".

I guess what I'm getting at is whether there's a built-in way of doing this in R? If not, I imagine I can get this done with a standard loop construct. Or do I even need to manipulate the data?

Best Answer

There are several ways how you can get a lagged variable within a group. First of all you should sort the data, so that in each group the time is sorted accordingly.

First let us create a sample data.frame:

> set.seed(13)
> dt <- data.frame(location = rep(letters[1:2], each = 4), time = rep(1:4, 2), var = rnorm(8))
> dt
  location time        var
1        a    1  0.5543269
2        a    2 -0.2802719
3        a    3  1.7751634
4        a    4  0.1873201
5        b    1  1.1425261
6        b    2  0.4155261
7        b    3  1.2295066
8        b    4  0.2366797

Define our lag function:

 lg <- function(x)c(NA, x[1:(length(x)-1)])
  1. Then the lag of variable within group can be calculated using tapply:

     > unlist(tapply(dt$var, dt$location, lg))
        a1         a2         a3         a4         b1         b2         b3         b4 
        NA  0.5543269 -0.2802719  1.7751634         NA  1.1425261  0.4155261  1.2295066
    
  2. Using ddply from package plyr:

    > ddply(dt, ~location, transform, lvar = lg(var))
      location time        var       lvar
    1        a    1 -0.1307015         NA
    2        a    2 -0.6365957 -0.1307015
    3        a    3 -0.6417577 -0.6365957
    4        a    4 -1.5191950 -0.6417577
    5        b    1 -1.6281638         NA
    6        b    2  0.8748671 -1.6281638
    7        b    3 -1.3343222  0.8748671
    8        b    4  1.5431753 -1.3343222  
    
  3. Speedier version using data.table from package data.table

     > ddt <- data.table(dt)
     > ddt[,lvar := lg(var), by = c("location")]
         location time        var       lvar
    [1,]        a    1 -0.1307015         NA
    [2,]        a    2 -0.6365957 -0.1307015
    [3,]        a    3 -0.6417577 -0.6365957
    [4,]        a    4 -1.5191950 -0.6417577
    [5,]        b    1 -1.6281638         NA
    [6,]        b    2  0.8748671 -1.6281638
    [7,]        b    3 -1.3343222  0.8748671
    [8,]        b    4  1.5431753 -1.3343222
    
  4. Using lag function from package plm

     > pdt <- pdata.frame(dt)
     > lag(pdt$var)
       a-1        a-2        a-3        a-4        b-1        b-2        b-3        b-4 
        NA  0.5543269 -0.2802719  1.7751634         NA  1.1425261  0.4155261  1.2295066
    
  5. Using lag function from package dplyr

    > dt %>% group_by(location) %>% mutate(lvar = lag(var))        
    Source: local data frame [8 x 4]
    Groups: location        
      location time        var       lvar
    1        a    1  0.5543269         NA
    2        a    2 -0.2802719  0.5543269
    3        a    3  1.7751634 -0.2802719
    4        a    4  0.1873201  1.7751634
    5        b    1  1.1425261         NA
    6        b    2  0.4155261  1.1425261
    7        b    3  1.2295066  0.4155261
    8        b    4  0.2366797  1.2295066
    

Last two approaches require conversion from data.frame to another object, although then you do not need to worry about sorting. My personal preference is the last one, which was not available when writing the answer initially.

Update: Changed the data.table code to reflect the developments of the data.table package, pointed out by @Hibernating.

Update 2: Added dplyr example.

Related Question