I have a few tens of thousands of observations that are in a time series but grouped by locations. For example:
location date observationA observationB
---------------------------------------
A 1-2010 22 12
A 2-2010 26 15
A 3-2010 45 16
A 4-2010 46 27
B 1-2010 167 48
B 2-2010 134 56
B 3-2010 201 53
B 4-2010 207 42
I want to see if month x's observationA
has any linear relationship with month x+1's observationB
.
I did some research and found a zoo
function, but it doesn't appear to have a way to limit the lag by group. So if I used zoo and lagged observationB
by 1 row, I'd end up with the location A's last observationB
as location B's first observationB
. I'd rather have the first observationB
of any location to be NA
or some other obvious value to indicate "don't touch this row".
I guess what I'm getting at is whether there's a built-in way of doing this in R? If not, I imagine I can get this done with a standard loop construct. Or do I even need to manipulate the data?
Best Answer
There are several ways how you can get a lagged variable within a group. First of all you should sort the data, so that in each group the time is sorted accordingly.
First let us create a sample data.frame:
Define our lag function:
Then the lag of variable within group can be calculated using
tapply
:Using
ddply
from package plyr:Speedier version using
data.table
from package data.tableUsing
lag
function from package plmUsing
lag
function from package dplyrLast two approaches require conversion from
data.frame
to another object, although then you do not need to worry about sorting. My personal preference is the last one, which was not available when writing the answer initially.Update: Changed the data.table code to reflect the developments of the data.table package, pointed out by @Hibernating.
Update 2: Added dplyr example.