I am trying to tackle a problem which deals with the imputation of missing data from a panel data study(Not sure if I am using 'panel data study' correctly – as I learned it today.) I have total death count data for years 2003 to 2009, all the months, male & female, for 8 different districts and for 4 age groups.
The dataframe looks something like this:
District Gender Year Month AgeGroup TotalDeaths
Northern Male 2006 11 01-4 0
Northern Male 2006 11 05-14 1
Northern Male 2006 11 15+ 83
Northern Male 2006 12 0 3
Northern Male 2006 12 01-4 0
Northern Male 2006 12 05-14 0
Northern Male 2006 12 15+ 106
Southern Female 2003 1 0 6
Southern Female 2003 1 01-4 0
Southern Female 2003 1 05-14 3
Southern Female 2003 1 15+ 136
Southern Female 2003 2 0 6
Southern Female 2003 2 01-4 0
Southern Female 2003 2 05-14 1
Southern Female 2003 2 15+ 111
Southern Female 2003 3 0 2
Southern Female 2003 3 01-4 0
Southern Female 2003 3 05-14 1
Southern Female 2003 3 15+ 141
Southern Female 2003 4 0 4
For the 10 months spread over 2007 and 2008 some of the total deaths from all districts were not recorded. I am trying to estimate these missing value through a multiple imputation method. Either using Generalized Linear Models or SARIMA models.
My biggest issue is the use of software and the coding. I asked a question on Stackoverflow, where I want to extract the data into smaller groups such as this:
District Gender Year Month AgeGroup TotalDeaths
Northern Male 2003 1 01-4 0
Northern Male 2003 2 01-4 1
Northern Male 2003 3 01-4 0
Northern Male 2003 4 01-4 3
Northern Male 2003 5 01-4 4
Northern Male 2003 6 01-4 6
Northern Male 2003 7 01-4 5
Northern Male 2003 8 01-4 0
Northern Male 2003 9 01-4 1
Northern Male 2003 10 01-4 2
Northern Male 2003 11 01-4 0
Northern Male 2003 12 01-4 1
Northern Male 2004 1 01-4 1
Northern Male 2004 2 01-4 0
Going to
Northern Male 2006 11 01-4 0
Northern Male 2006 12 01-4 0
But someone suggested I should rather bring my question here – perhaps ask for a direction? Currently I am unable to enter this data as a proper time-series/panel study into R. My eventual aim is to use this data and the amelia2
package with its functions to impute for missing TotalDeaths
for certain months in 2007 and 2008, where the data is missing.
Any help, how to do this and perhaps suggestions on how to tackle this problem would be gratefully appreciated.
If this helps, I am trying to follow a similar approach to what Clint Roberts did in his PhD Thesis.
EDIT:
After creating the 'time' and 'group' variable as suggested by @Matt:
> head(dat)
District Gender Year Month AgeGroup Unnatural Natural Total time group
1 Khayelitsha Female 2001 1 0 0 6 6 1 Khayelitsha.Female.0
2 Khayelitsha Female 2001 1 01-4 1 3 4 1 Khayelitsha.Female.01-4
3 Khayelitsha Female 2001 1 05-14 0 0 0 1 Khayelitsha.Female.05-14
4 Khayelitsha Female 2001 1 15up 8 73 81 1 Khayelitsha.Female.15up
5 Khayelitsha Female 2001 2 0 2 9 11 2 Khayelitsha.Female.0
6 Khayelitsha Female 2001 2 01-4 0 2 2 2 Khayelitsha.Female.01-4
As you notice, there's actually further detail 'Natural' and 'Unnatural'.
Best Answer
You can use the
Amelia
package to impute the data (full disclosure: I am one of the authors ofAmelia
). The package vignette has an extended example of how to use it to impute missing data.It seems as though you have units which are district-gender-ageGroup observed at the monthly level. First you create a factor variable for each type of unit (that is, one level for each district-gender-ageGroup). Let's call this
group
. Then, you would need a variable for time, which is probably the number of months since January 2003. Thus, this variable would be 13 in January of 2004. Call this variabletime
. Amelia will allow you to impute based on the time trends with the following commands:The
ts
andcs
arguments simply denote the time and unit variables. Thesplinetime
argument sets how flexible should time be used to impute the missing data. Here, a 2 means that the imputation will use a quadratic function of time, but higher values will be more flexible. Theintercs
argument here tells Amelia to use a separate time trend for each district-gender-ageGroup. This adds many parameters to the model, so if you run into trouble, you can set this toFALSE
to try to debug.In any event, this will get you imputations using the time information in your data. Since the missing data is bounded at zero, you can use the
bounds
argument to force imputations into those logical bounds.EDIT: How to create group/time variables
The time variable might be the easiest to create, because you just need to count from 2002 (assuming that is the lowest year in your data):
The group variable is slightly harder but a quick way to do it is using the paste command:
With these variables created, you want to remove the original variables from the imputation. To do that you can use the
idvars
argument: