I'm trying to do some percentage based comparisons across different groups in a criminal sentencing data set (http://dl.dropbox.com/u/1156404/wightCrimRecords.csv)
I have a data in an array of the form:
$Female
x
Burglary Criminal Damage and Arson
0.004950495 0.017326733
Driving Offences Murder
0.371287129 0.000000000
$Male
x
Burglary Criminal Damage and Arson
0.013001083 0.058504875
Driving Offences Murder
0.303358613 0.000000000
$`Not Stated`
x
Burglary Criminal Damage and Arson
0.0000000 0.0000000
Driving Offences Murder
0.1111111 0.0000000
This was derived from the original data as follows:
iw=read.csv("~/data/recordlevel.csv")
iwp=tapply(iw$Offence_type,iw$AGE,function(x){prop.table(table(x))})
What I would like to do is generate a single data frame that contains a gender column, a frequency column, and rows corresponding to Burglary, Murder etc.
I can extract a single datatable from the multidimensional array, eg using:
iwpF =data.frame(iwp['Female'])
which generates a separate row for each offence and columns referring to offence type and frequency, but can't see how to generate a single datatable.
PS I was also wondering whether it's possible to pull out even more structured data, that for example counts the percentages of offence type sex and age group, so for example I could lookup up what percentage of convictions for males in the 35+ age range are related to murder.
Best Answer
I'm not sure I followed the PS part of your question, but maybe this will get you on the right path. The trick is to use
melt()
to get the data into long format, then useddply()
to group by:Gives us:
EDIT - after reading the PS again, I think this is what you had in mind:
You can obviously continue to add ID variables which then get passed into plyr to group on to any level of detail that is sufficient for your purposes.