Solved – Generate separate plots for each group of records in dataset

data visualizationexcel

I have a large csv file with around 45,000 rows and about 6 columns.

The columns are set-up like:

Module, File-path, Status, Version-Old, Version-New, Change-diff

Change-diff is simply the result of doing: New – Old.

There are around 30 different modules and I want to perform analysis and create different graphs for each module based on the change-diff.

I want to know the number of "rows" for each 'module' – I guess I can just count the number of each module name for the Module column to work that out.

I'm wondering if there is any program that I can use to make my job easier, I'm currently trying to plug it all into Excel but I'm not sure how I can create graphs for each module instead of one big graph for all modules.

Would R, SPSS or Matlab be a better option? Or anything else?

EDIT:
Just expanding on what the data looks like.

Module,  Path,      Status,  Old,  New,  Change
modA,    modA/a/a,  1,       11,   19,   8
modA,    modA/a/b,  1,       2,    4,    2
modA,    modA/a/c,  1,       0,    1,    1
modA,    modA/b/c,  0,       0,    1,    1
modA,    modA/b/a,  1,       7,    12,   5
modB,    modB/a/a,  1,       8,    9,    1
modB,    modB/a/b,  0,       0,    1,    1
modB,    modB/a/c,  1,       4,    10,   6
modC,    modC/a/a,  0,       0,    4,    4
modC,    modC/a/b,  1,       0,    3,    3

So I want to analyse all of modA based on Change (and possibly based on Old vs New etc.)
Then analyse all of modB, modC etc. all in the same way but in separate graphs, automatically because of the amount of data I'm dealing with.

Thanks in advance

Best Answer

Your question is a little ambiguous ("different graphs for each module" doesn't mean quite the same thing as "separate graphs for each module"), but my impression is that you're trying to create the same plot or set of plots using only the data from each module, for all of your modules. This is how I'd do that in R:

# Strings ain't factors
options(stringsAsFactors = FALSE)


# Set up a folder for plot output
plotdir <- file.path(getwd(), "plots")
dir.create(plotdir)

# Load the data
# You'll use a read.csv() statement
modrows <- read.csv("blah.csv")

# This is the structure of your data as output by `dput()` - this is nice
# to include with a question because it can easily be pasted into the console
# to recreate any object
modrows <- structure(list(Module = c("modA", "modA", "modA", "modA", "modA",
                                     "modB", "modB", "modB", "modC", "modC"), 
                          Path = c("modA/a/a", "modA/a/b", "modA/a/c", 
                                   "modA/b/c", "modA/b/a", "modB/a/a", 
                                   "modB/a/b", "modB/a/c", "modC/a/a", 
                                   "modC/a/b"), 
                          Status = c(1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L), 
                          Old = c(11L, 2L, 0L, 0L, 7L, 8L, 0L, 4L, 0L, 0L), 
                          New = c(19L, 4L, 1L, 1L, 12L, 9L, 1L, 10L, 4L, 3L), 
                          Change = c(8L, 2L, 1L, 1L, 5L, 1L, 1L, 6L, 4L, 3L)), 
                     .Names = c("Module", "Path", "Status", 
                                "Old", "New", "Change"), 
                     class = "data.frame", 
                     row.names = c(NA, -10L))



# There's nothing below that can't be done in base R, but I'm partial to plyr
# and ggplot2 for this kind of work.  In plyr, the first two letters of most
# functions indicate the input and output data structures - ddply takes and
# outputs a data.frame, dlplyr takes a DF and puts out a list, etc.
# d_ply splits up a data.frame but produces nothing.  That's fine because
# I'm just using it as a glorified for loop here.
library(plyr)
library(ggplot2)

# Get the extreme values of each variable so that you can set reasonable
# axis limits to keep your plots comparable
minvals <- apply(modrows[ , -(1:3)], 2, min)
maxvals <- apply(modrows[ , -(1:3)], 2, max)

# Take modrows, split it up by Module, and perform a function on each chunk
# (which is aliased as "x" for the function)
d_ply(modrows, .var = "Module", .progress = "text", .fun = function(x){

    # A really simple histogram of Change for each module
    ggplot(x, aes(x = Change)) +
        geom_histogram(binwidth = 1) +
        xlim(minvals["Change"], maxvals["Change"]) +
        opts(title = paste("Change in ", x$Module[1], sep = ""))

    # Save to the plot directory
    ggsave(file = file.path(plotdir, 
                            paste("Change in ", x$Module[1], ".png", sep = ""))
    )

    # A slightly more complicated plot - subhistograms of Change by Status
    ggplot(x, aes(x = Change)) +
        geom_histogram(binwidth = 1) +
        facet_wrap( ~ Status) +
        xlim(minvals["Change"], maxvals["Change"]) +
        opts(title = paste("Change by status in ", x$Module[1], sep = ""))
    ggsave(file = file.path(plotdir, paste("Change by status in ", 
                                           x$Module[1], ".png", sep = ""))
    )

    # Comparing Old and New
    ggplot(x, aes(x = Old, y = New)) +
        geom_point() +
        xlim(minvals["Old"], maxvals["Old"]) +
        ylim(minvals["New"], maxvals["New"]) +
        opts(title = paste("Old vs. new in ", x$Module[1], sep = ""))

    ggsave(file = file.path(plotdir, paste("Old vs. new in", 
                                           x$Module[1], ".png", sep = ""))
    )

    }
)

Related Solutions

Solved – Combine multiple boxplots in a lattice

Sam,

I think I understood what you are after, so let me know if I've misinterpreted anything:

You want a separate box_plot for the ratio of each pairs of columns. There are 15 ratios we are interested in...(column 6 / column 7, column 8 / column 9, etc.)
This plot should have a separate "window" or facet for each annotation, for which there are 23 different annotations.

Assuming both of those are right, I think this will give you what you are after. First, we will make the 15 new ratio columns with a for-loop and some indexing. After we make these 15 new columns, we will melt the data into long format for easy plotting with ggplot2. Since we are only interested in the columns annotation and the new ratio columns, we'll specify those in the call to melt. Then it is a relatively straight forward call to ggplot to specify the axes and faceting variable.

These plots don't make much sense with 10 rows of data, but I think it will look better with your full dataset.

library(ggplot2)

#EDIT: this removes the call to cbind which should improve performance.
for (i in seq(6, ncol(df), by = 2)) {
    df[,  paste(i, i+1, sep = "_", collapse = "")] <- df[, i ] / df[, i + 1 ]

}

df.m <- melt(df, id.vars = "annotation", measure.vars = 36:ncol(df))
#Note that we use the column name for the id.vars and the column order for
#the measure.vars. In the case of the latter, this is simply to save on 
#typing.

ggplot(data = df.m, aes(x = variable, y = value)) + 
    geom_boxplot() + 
    facet_wrap(~ annotation) +
    coord_flip()

Solved – Analyzing CART type trees in R–options for grouping results, formatting plots

For a multi-class model, use the rpart.plot package to show at each leaf the predicted class and the probabilities for each class. For example:

data(iris)
library(rpart)
a <- rpart(Species~., data=iris, method="class", cp=.001, minsplit=5)
library(rpart.plot)
rpart.plot(a, type=0, extra=4, under=TRUE, branch=.3)

which gives:

plot

This doesn't show bar charts, but the displayed the class probabilities are just as informative, and more compact. I believe this essentially is the information on the plot you would like displayed.

Extra information can be displayed by using the type and extra arguments of the rpart.plot function. See the bottom of Figure 1 in the vignette for rpart.plot package for another example with a multi-class response.

Best Answer

Related Solutions

Solved – Combine multiple boxplots in a lattice

Solved – Analyzing CART type trees in R–options for grouping results, formatting plots

Related Question