Solved – Casting multidimensional data in R into a data frame

I'm trying to do some percentage based comparisons across different groups in a criminal sentencing data set (http://dl.dropbox.com/u/1156404/wightCrimRecords.csv)

I have a data in an array of the form:

    $Female
    x
                             Burglary                 Criminal Damage and Arson 
                          0.004950495                               0.017326733 
                     Driving Offences                                    Murder 
                          0.371287129                               0.000000000
    $Male
    x
                             Burglary                 Criminal Damage and Arson 
                          0.013001083                               0.058504875 
                     Driving Offences                                    Murder 
                          0.303358613                               0.000000000

    $`Not Stated`
    x
                             Burglary                 Criminal Damage and Arson 
                            0.0000000                                 0.0000000 
                     Driving Offences                                    Murder 
                            0.1111111                                 0.0000000

This was derived from the original data as follows:

iw=read.csv("~/data/recordlevel.csv")
iwp=tapply(iw$Offence_type,iw$AGE,function(x){prop.table(table(x))})

What I would like to do is generate a single data frame that contains a gender column, a frequency column, and rows corresponding to Burglary, Murder etc.

I can extract a single datatable from the multidimensional array, eg using:

iwpF =data.frame(iwp['Female'])

which generates a separate row for each offence and columns referring to offence type and frequency, but can't see how to generate a single datatable.

PS I was also wondering whether it's possible to pull out even more structured data, that for example counts the percentages of offence type sex and age group, so for example I could lookup up what percentage of convictions for males in the 35+ age range are related to murder.

Best Answer

I'm not sure I followed the PS part of your question, but maybe this will get you on the right path. The trick is to use melt() to get the data into long format, then use ddply() to group by:

library(plyr)
library(reshape2)
iw <- read.csv("http://dl.dropbox.com/u/1156404/wightCrimRecords.csv")
iw.m <- melt(iw, id.vars = "sex", measure.vars = "Offence_type")
ddply(iw.m, "sex", function(x) as.data.frame(prop.table(table(x$value))))

Gives us:

          sex                                      Var1        Freq
1      Female                                  Burglary 0.004950495
2      Female                 Criminal Damage and Arson 0.017326733
3      Female                          Driving Offences 0.371287129
...
50      Other                           Supply of drugs 0.000000000
51      Other                             Vehicle Crime 0.000000000
52      Other                             Violent Crime 0.000000000

EDIT - after reading the PS again, I think this is what you had in mind:

iw.m <- melt(iw, id.vars = c("sex", "AGE"), measure.vars = "Offence_type")
ddply(iw.m, c("sex", "AGE"), function(x) as.data.frame(prop.table(table(x$value))))

           sex   AGE                                      Var1        Freq
1       Female 18-24                                  Burglary 0.011764706
2       Female 18-24                 Criminal Damage and Arson 0.047058824
3       Female 18-24                          Driving Offences 0.188235294
....

You can obviously continue to add ID variables which then get passed into plyr to group on to any level of detail that is sufficient for your purposes.

Related Solutions

Solved – How to expand data frame in R

While it is a very useful package, I think reshape is overkill in this case, rep can do the job.

Here are some example data:

df <- data.frame(
     name=c("Person 1", "Person 2", "Person 3", "Person 1", "Person 2", "Person 3"),
     group=c("A", "A", "A", "B", "B", "B"),
     count=c(3,1,0,5,0,1))

Now, to “expand” it:

expanded <- data.frame(name = rep(df$name, df$count),
                       group = rep(df$group, df$count))

I could not find a way to work directly on the data frame off the top of my head so I am working on each variable separately and then reassembling them, which is a bit ugly but should be OK as long as you take care of always using the same variable for the counts.

Solved – Multidimensional Scaling “eurodist”

(You need library(MASS) in your code it seems.) From ?eurodist:

The data give the road distances (in km) between 21 cities in Europe. The data are taken from a table in The Cambridge Encyclopaedia.

This is in addition to problem (3) mentioned by ttnphns in the comments. Not only are they not flat distances, but they are not distances as the crow flies either. As one example, the outlier at (1662, 713) on the Shepard plot corresponds to the pair (Cologne, Geneva). (It is slightly difficult to find this because the author of Shepard doesn't seem to have bothered to document it.) Looking at the map of Europe, I think this journey has to be made by quite a wiggly route. You can see the outlier by plotting the distances for Cologne only:

plot(as.matrix(eurodist)[6,], as.matrix(dist(obj))[6,])

Best Answer

Related Solutions

Solved – How to expand data frame in R

Solved – Multidimensional Scaling “eurodist”

Related Question