Solved – How to aggregate by minute data for a week into hourly means?

aggregationrtime series

How would you get hourly means for multiple data columns, for a daily period, and show results for twelve "Hosts" in the same graph? That is, I'd like to graph what a 24 hour period looks like, for a weeks worth of data. The eventual goal would be to compare two sets of this data, before and after samplings.

                dates     Hos      CPUIOWait CPUUser CPUSys
1 2011-02-11 23:55:12     db       0         14      8
2 2011-02-11 23:55:10     app1     0          6      1
3 2011-02-11 23:55:09     app2     0          4      1

I've been able to run xyplot(CPUUser ~ dates | Host) with good effect. However, rather than showing each date in the week, I'd like the X axis to be the hours of the day.

Trying to get this data into an xts object results in errors such as:

"order.by requires an appropriate time-based object"

Here is a str() of the data frame:

'data.frame':   19720 obs. of  5 variables:
$ dates    : POSIXct, format: "2011-02-11 23:55:12" "2011-02-11 23:55:10" ...
$ Host     : Factor w/ 14 levels "app1","app2",..: 9 7 5 4 3 10 6 8 2 1 ...  
$ CPUIOWait: int  0 0 0 0 0 0 0 0 0 0 ...
$ CPUUser  : int  14 6 4 4 3 10 4 3 4 4 ...
$ CPUSys   : int  8 1 1 1 1 3 1 1 1 1 ...

UPDATE: Just for future reference, I decided to go with a boxplot, to show both the median, and the 'outliers'.

Essentially:

Data$hour <- as.POSIXlt(dates)$hour  # extract hour of the day
boxplot(Data$CPUUser ~ Data$hour)    # for a subset with one host or for all hosts
xyplot(Data$CPUUser ~ Data$hour | Data$Host, panel=panel.bwplot, horizontal=FALSE)

Best Answer

Here is one approach using cut() to create the appropriate hourly factors and ddply() from the plyr library for calculating the means.

library(lattice)
library(plyr)

## Create a record and some random data for every 5 seconds 
## over two days for two hosts.
dates <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
             as.POSIXct("2011-01-02 23:59:55", tz = "GMT"),
             by = 5)
hosts <- c(rep("host1", length(dates)), rep("host2", 
           length(dates)))
x1    <- sample(0:20, 2*length(dates), replace = TRUE)
x2    <- rpois(2*length(dates), 2)
Data  <- data.frame(dates = dates, hosts = hosts, x1 = x1, 
                    x2 = x2)

## Calculate the mean for every hour using cut() to define 
## the factors and ddply() to calculate the means. 
## getmeans() is applied for each unique combination of the
## hosts and hour factors.
getmeans  <- function(Df) c(x1 = mean(Df$x1), 
                            x2 = mean(Df$x2))
Data$hour <- cut(Data$dates, breaks = "hour")
Means <- ddply(Data, .(hosts, hour), getmeans)
Means$hour <- as.POSIXct(Means$hour, tz = "GMT")

## A plot for each host.
xyplot(x1 ~ hour | hosts, data = Means, type = "o",
       scales = list(x = list(relation = "free", rot = 90)))
Related Question