Solved – Visualizing large dataset with multiple subgroups

data visualizationgroup-differencesrregressionseasonality

I have a large data frame in the following form (I apologize for this formatting):

Site    Season  T          SC    pH    Chl   DO.S   DO      BGA  Tur    fDOM    Flow    Rainfall    Solar      Rain
300N    Winter  14.05   1692.77 7.93    NA  82.26   8.42    NA  9.25    NA      NA      0.00          219.18     no

If you can't understand the formatting, there are 12 numerical factors, and 3 categorical factors (Site, Season, Rain [yes/no]). Each row represents the average daily values that I have calculated from 15-minute time series. I have spent a good amount of time doing data exploration (linear regression analysis, looking at time series plots for patterns), but haven't found a method that works for me yet. I have also worked with corrplot, correlation matrices, and covariance functions in an arduous way, where I subset each categorical combination and found corrplots for each (I have also tried it with ddply, but the resulting format is not in the correlation matrix format that is easy to plot). I have also attempted PCA on the data to little avail.

My question is first and foremost, does anyone have an idea for data visualization of this kind of dataset? The main question I am after is, "What are the factors that influence DO (dissolved oxygen)?". How does this change by location (Site), Season, and with the influence of Rain. I would really like a quick method for shooting out correlation matrices (or heat maps; I have tried both) for each categorical subset. I tried this with ggplot and facet_wrap, but it wasn't happening for me. I also tried ggpairs from the GGally package, but honestly didn't spend too much time with that method.

I was starting to get into the idea of star graphs (on polar coordinates), which can be used to visualize repeating periodicity in time series, but am running out of time and decided to seek the advisement of Stack Overflow. I really appreciate any advice or thoughts on visualizing this data that come to your mind. I feel like some combination of ddply and graphing is what I need, but I haven't gotten there yet.
Thank you for your time.

EDIT:
dput of the data frame in question:

structure(list(Site = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("2100S", 
"300N", "3300S", "800S", "Burnham", "Center"), class = "factor"), 
    Season = structure(c(4L, 4L, 4L, 4L, 2L, 2L), .Label = c("Fall", 
    "Spring", "Summer", "Winter"), class = "factor"), T = c(14.05, 
    14.18, 14.5, 14.58, 14.07, 11.91), SC = c(1692.77, 1671.31, 
    1680.71, 1661.79, 1549.56, 1039.63), pH = c(7.93, 7.92, 7.96, 
    7.95, 7.93, 7.79), Chl = c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_), DO.S = c(82.26, 78.79, 82.05, 
    80.92, 74.33, 73.96), DO = c(8.42, 8.04, 8.31, 8.18, 7.61, 
    7.97), BGA = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), Tur = c(9.25, 9.77, 9.41, 10.6, 40.38, 50.25), 
    fDOM = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), Flow = c(NA, 178.08, 178.53, 188.13, 306.15, 382.22
    ), Rainfall = c(0, 0, 0, 0, 0.01, 0.81), Solar = c(219.18, 
    228.33, 244.3, 247.69, 105.15, 220.73), Rain = structure(c(1L, 
    1L, 1L, 1L, 2L, 2L), .Label = c("no", "yes"), class = "factor")), .Names = c("Site", 
"Season", "T", "SC", "pH", "Chl", "DO.S", "DO", "BGA", "Tur", 
"fDOM", "Flow", "Rainfall", "Solar", "Rain"), row.names = c(NA, 
6L), class = "data.frame")

Best Answer

Seems like kind of a tall order, but here's a whirlwind tour of R.

library(party)
library(rattle)
library(ggplot2)
library(car)

#this will expand your test set so that it is large enough to generate a tree.
DO <- rbind(DO, DO, DO, DO)
DO.ctree <- ctree(DO ~ ., data = DO, 
               controls = ctree_control(maxsurrogate = 3))
plot(DO.ctree)
#I think this answers both your "first and foremost" and your "main" questions.
#In brief: The party package helps identify which variables most influence the 
#dependent variable

ctree output

ggplot(DO, aes(factor(Season), DO)) + geom_point()
#lots of easy descriptive stats in ggplot package

dotplot from ggplot2

DO <- DO[, !sapply(DO, function (x) all(is.na(x)))]
DO.numeric <- DO[ ,sapply(DO, is.numeric)]
round(cor(na.omit(DO.numeric)), 1)
#           T   SC   pH DO.S   DO  Tur Flow Rainfall Solar
# T         1.0  1.0  1.0  0.7  0.3 -0.8 -0.9     -1.0   0.0
# SC        1.0  1.0  1.0  0.7  0.3 -0.9 -0.9     -1.0   0.1
# pH        1.0  1.0  1.0  0.7  0.3 -0.8 -0.8     -1.0   0.0
# DO.S      0.7  0.7  0.7  1.0  0.9 -0.9 -0.9     -0.6   0.7
# DO        0.3  0.3  0.3  0.9  1.0 -0.7 -0.6     -0.1   0.9
# Tur      -0.8 -0.9 -0.8 -0.9 -0.7  1.0  1.0      0.8  -0.6
# Flow     -0.9 -0.9 -0.8 -0.9 -0.6  1.0  1.0      0.8  -0.5
# Rainfall -1.0 -1.0 -1.0 -0.6 -0.1  0.8  0.8      1.0   0.1
# Solar     0.0  0.1  0.0  0.7  0.9 -0.6 -0.5      0.1   1.0
#Here's a brief corelation summary

scatterplotMatrix(na.omit(DO.numeric))
#Here's the big chart of correlations I think you requested

scatterplotMatrix

You may be interested in checking out the rattle package/GUI: it can get you off to a quick start with a lot these general questions.

Related Question