Solved – Generate separate plots for each group of records in dataset

data visualizationexcel

I have a large csv file with around 45,000 rows and about 6 columns.

The columns are set-up like:

Module, File-path, Status, Version-Old, Version-New, Change-diff

Change-diff is simply the result of doing: New – Old.

There are around 30 different modules and I want to perform analysis and create different graphs for each module based on the change-diff.

I want to know the number of "rows" for each 'module' – I guess I can just count the number of each module name for the Module column to work that out.

I'm wondering if there is any program that I can use to make my job easier, I'm currently trying to plug it all into Excel but I'm not sure how I can create graphs for each module instead of one big graph for all modules.

Would R, SPSS or Matlab be a better option? Or anything else?

EDIT:
Just expanding on what the data looks like.

Module,  Path,      Status,  Old,  New,  Change
modA,    modA/a/a,  1,       11,   19,   8
modA,    modA/a/b,  1,       2,    4,    2
modA,    modA/a/c,  1,       0,    1,    1
modA,    modA/b/c,  0,       0,    1,    1
modA,    modA/b/a,  1,       7,    12,   5
modB,    modB/a/a,  1,       8,    9,    1
modB,    modB/a/b,  0,       0,    1,    1
modB,    modB/a/c,  1,       4,    10,   6
modC,    modC/a/a,  0,       0,    4,    4
modC,    modC/a/b,  1,       0,    3,    3

So I want to analyse all of modA based on Change (and possibly based on Old vs New etc.)
Then analyse all of modB, modC etc. all in the same way but in separate graphs, automatically because of the amount of data I'm dealing with.

Thanks in advance

Best Answer

Your question is a little ambiguous ("different graphs for each module" doesn't mean quite the same thing as "separate graphs for each module"), but my impression is that you're trying to create the same plot or set of plots using only the data from each module, for all of your modules. This is how I'd do that in R:

# Strings ain't factors
options(stringsAsFactors = FALSE)


# Set up a folder for plot output
plotdir <- file.path(getwd(), "plots")
dir.create(plotdir)

# Load the data
# You'll use a read.csv() statement
modrows <- read.csv("blah.csv")

# This is the structure of your data as output by `dput()` - this is nice
# to include with a question because it can easily be pasted into the console
# to recreate any object
modrows <- structure(list(Module = c("modA", "modA", "modA", "modA", "modA",
                                     "modB", "modB", "modB", "modC", "modC"), 
                          Path = c("modA/a/a", "modA/a/b", "modA/a/c", 
                                   "modA/b/c", "modA/b/a", "modB/a/a", 
                                   "modB/a/b", "modB/a/c", "modC/a/a", 
                                   "modC/a/b"), 
                          Status = c(1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L), 
                          Old = c(11L, 2L, 0L, 0L, 7L, 8L, 0L, 4L, 0L, 0L), 
                          New = c(19L, 4L, 1L, 1L, 12L, 9L, 1L, 10L, 4L, 3L), 
                          Change = c(8L, 2L, 1L, 1L, 5L, 1L, 1L, 6L, 4L, 3L)), 
                     .Names = c("Module", "Path", "Status", 
                                "Old", "New", "Change"), 
                     class = "data.frame", 
                     row.names = c(NA, -10L))



# There's nothing below that can't be done in base R, but I'm partial to plyr
# and ggplot2 for this kind of work.  In plyr, the first two letters of most
# functions indicate the input and output data structures - ddply takes and
# outputs a data.frame, dlplyr takes a DF and puts out a list, etc.
# d_ply splits up a data.frame but produces nothing.  That's fine because
# I'm just using it as a glorified for loop here.
library(plyr)
library(ggplot2)

# Get the extreme values of each variable so that you can set reasonable
# axis limits to keep your plots comparable
minvals <- apply(modrows[ , -(1:3)], 2, min)
maxvals <- apply(modrows[ , -(1:3)], 2, max)

# Take modrows, split it up by Module, and perform a function on each chunk
# (which is aliased as "x" for the function)
d_ply(modrows, .var = "Module", .progress = "text", .fun = function(x){

    # A really simple histogram of Change for each module
    ggplot(x, aes(x = Change)) +
        geom_histogram(binwidth = 1) +
        xlim(minvals["Change"], maxvals["Change"]) +
        opts(title = paste("Change in ", x$Module[1], sep = ""))

    # Save to the plot directory
    ggsave(file = file.path(plotdir, 
                            paste("Change in ", x$Module[1], ".png", sep = ""))
    )

    # A slightly more complicated plot - subhistograms of Change by Status
    ggplot(x, aes(x = Change)) +
        geom_histogram(binwidth = 1) +
        facet_wrap( ~ Status) +
        xlim(minvals["Change"], maxvals["Change"]) +
        opts(title = paste("Change by status in ", x$Module[1], sep = ""))
    ggsave(file = file.path(plotdir, paste("Change by status in ", 
                                           x$Module[1], ".png", sep = ""))
    )

    # Comparing Old and New
    ggplot(x, aes(x = Old, y = New)) +
        geom_point() +
        xlim(minvals["Old"], maxvals["Old"]) +
        ylim(minvals["New"], maxvals["New"]) +
        opts(title = paste("Old vs. new in ", x$Module[1], sep = ""))

    ggsave(file = file.path(plotdir, paste("Old vs. new in", 
                                           x$Module[1], ".png", sep = ""))
    )

    }
)
Related Question