You can use the Amelia
package to impute the data (full disclosure: I am one of the authors of Amelia
). The package vignette has an extended example of how to use it to impute missing data.
It seems as though you have units which are district-gender-ageGroup observed at the monthly level. First you create a factor variable for each type of unit (that is, one level for each district-gender-ageGroup). Let's call this group
. Then, you would need a variable for time, which is probably the number of months since January 2003. Thus, this variable would be 13 in January of 2004. Call this variable time
. Amelia will allow you to impute based on the time trends with the following commands:
library(Amelia)
a.out <- amelia(my.data, ts = "time", cs = "group", splinetime = 2, intercs = TRUE)
The ts
and cs
arguments simply denote the time and unit variables. The splinetime
argument sets how flexible should time be used to impute the missing data. Here, a 2 means that the imputation will use a quadratic function of time, but higher values will be more flexible. The intercs
argument here tells Amelia to use a separate time trend for each district-gender-ageGroup. This adds many parameters to the model, so if you run into trouble, you can set this to FALSE
to try to debug.
In any event, this will get you imputations using the time information in your data. Since the missing data is bounded at zero, you can use the bounds
argument to force imputations into those logical bounds.
EDIT: How to create group/time variables
The time variable might be the easiest to create, because you just need to count from 2002 (assuming that is the lowest year in your data):
my.data$time <- my.data$Month + 12 * (my.data$Year - 2002)
The group variable is slightly harder but a quick way to do it is using the paste command:
my.data$group <- with(my.data,
as.factor(paste(District, Gender, AgeGroup, sep = ".")))
With these variables created, you want to remove the original variables from the imputation. To do that you can use the idvars
argument:
a.out <- amelia(my.data, ts = "time", cs = "group", splinetime = 2, intercs = TRUE,
idvars = c("District", "Gender", "Month", "Year", "AgeGroup"))
I'm not sure I followed the PS part of your question, but maybe this will get you on the right path. The trick is to use melt()
to get the data into long format, then use ddply()
to group by:
library(plyr)
library(reshape2)
iw <- read.csv("http://dl.dropbox.com/u/1156404/wightCrimRecords.csv")
iw.m <- melt(iw, id.vars = "sex", measure.vars = "Offence_type")
ddply(iw.m, "sex", function(x) as.data.frame(prop.table(table(x$value))))
Gives us:
sex Var1 Freq
1 Female Burglary 0.004950495
2 Female Criminal Damage and Arson 0.017326733
3 Female Driving Offences 0.371287129
...
50 Other Supply of drugs 0.000000000
51 Other Vehicle Crime 0.000000000
52 Other Violent Crime 0.000000000
EDIT - after reading the PS again, I think this is what you had in mind:
iw.m <- melt(iw, id.vars = c("sex", "AGE"), measure.vars = "Offence_type")
ddply(iw.m, c("sex", "AGE"), function(x) as.data.frame(prop.table(table(x$value))))
sex AGE Var1 Freq
1 Female 18-24 Burglary 0.011764706
2 Female 18-24 Criminal Damage and Arson 0.047058824
3 Female 18-24 Driving Offences 0.188235294
....
You can obviously continue to add ID variables which then get passed into plyr to group on to any level of detail that is sufficient for your purposes.
Best Answer
If I understand the question correctly, this will get you what you want. Assuming your data frame is called
df
and you haveN
defined, you can do this:This will return a list of data frames where each data frame is consists of randomly selected rows from
df
. By defaultsample()
will assign equal probability to each group.