Solved – Average and standard deviation of timestamps (time wraps around at midnight)

circular statisticsmeanstandard deviationtime series

I have lots of sensor data with timestamps like "2014-09-09 16:10:45" and accompanying sensor readings. To get some insight into these I want to find "unusual" events by looking at the average and standard deviation of the time part of the timestamp. How can I handle the wrap-around of time on midnight?

A made up example: Imagine power readings being influenced by people turning on machines in the morning (a power sensor would notice increasing values) and turning them off in the evening (decreasing values). I want to find sensor readings that are unusual. This would be decreasing sensor readings in time periods where readings usually increase and increasing readings in time periods where readings usually decrease.

My idea was to extract the time part of the timestamps (eg. 12:55:10), convert that to seconds (the day has 86400) and then, split by the tendency of readings (eg only looking at increasing readings) calculate the average and standard deviation. If I then take the time window from "average second of the day minus standard deviation" to "average second of the day plus standard deviation" (maybe using twice the standard deviation) I would have the typical periods and every increasing reading outside this time window would be unusual.

The problem: Time wraps around at midnight! A reading at 00:15:00 would actually be really close to 23:50:00 in reality, but "far away" in calculation. This surely skews the statistics unless everything happens mid-day. Is there a standard practice to handle this? Can you give me ideas? I am totally stumped at the moment. I would love to stay in PostgresQL but as that is not a requirement I did not tag that. Anything helps!

Below is some example data, I have about 200-300 readings per sensor. You can see that in this example the increases happen in the morning.

"Timestamp as %Y-%m-%d %H:%M:%S";"Day of the year";"Second of the day";"Tendency of reading"
"2014-03-01 14:45:00";60;53100;-0.030
"2014-03-03 08:18:00";62;29880;0.150
"2014-03-03 14:17:00";62;51420;-0.120
"2014-03-03 16:37:00";62;59820;-0.030
"2014-03-04 08:11:00";63;29460;0.150
"2014-03-04 10:21:00";63;37260;-0.150
"2014-03-04 16:12:00";63;58320;-0.030
"2014-03-05 08:04:00";64;29040;0.150
"2014-03-05 14:42:00";64;52920;-0.060
"2014-03-05 17:27:00";64;62820;-0.030
"2014-03-06 08:29:00";65;30540;0.090
"2014-03-06 12:06:00";65;43560;-0.030
"2014-03-06 13:49:00";65;49740;-0.120
"2014-03-07 08:21:00";66;30060;0.150
"2014-03-07 10:27:00";66;37620;-0.030
"2014-03-07 11:27:00";66;41220;0.030
"2014-03-07 13:46:00";66;49560;-0.060
"2014-03-07 16:59:00";66;61140;-0.030
"2014-03-07 18:52:00";66;67920;-0.030
"2014-03-08 08:47:00";67;31620;0.120

Best Answer

Let's use the simplification you suggest: only use the data from positive readings and disregard the value of the reading, so we are left with a single set of circular data. You can use the circular dispersion, as whuber suggested, possibly multiplied by some constant to determine how much of the data should be seen as an outlier. A good text that is slightly easier to understand than the Wikipedia page would be Statistical Analysis of Circular data, by N.I. Fisher (1995).

I'll give some more straightforward formulas than the Wiki page, and give some sample code.

The dispersion can be calculated as (due to Fisher (p.32-34)):

Denote the data by $\boldsymbol\theta = \{\theta_1, \dots, \theta_n\}.$ An estimate of the mean $\bar\theta$ can be calculated with
- $S=\sum_{i=1}^{n}\sin(\theta_i)$,
- $C=\sum_{i=1}^{n}\cos(\theta_i)$,
- $\hat\mu = \text{atan2}(S, C)$. (See http://en.wikipedia.org/wiki/Atan2)
Calculate $\bar{R} = \frac{\sqrt{S^2 + C^2}}{n}$.
Calculate the dispersion as suggested by whuber. I'm not sure why, but Wikipedia's definition seems to differ slightly from Fisher. I will use Fisher's:
- $\hat\delta = \frac{1 - \left[ (1/n) \sum_{i=1}^{n} \cos 2 (\theta_i - \hat\mu) \right]}{2\bar{R}^2}.$
Then, choose some constant $c$ (1 is probably fine, but you may fine-tune). Then, the interval is given by
- $ \left[\hat\mu - c \hat\delta, \hat\mu + c \hat\delta \right]$.

I know you want to avoid R, but just to show how to calculate this in code, here is some basic R code anyway, which also generates a plot:

n  <- 200
th <- runif(n, 0.5 * pi, 1.5 * pi)

plot(cos(th), sin(th), xlim=c(-1, 1), ylim = c(-1, 1))

S <- sum(sin(th))
C <- sum(cos(th))

mu_hat <- atan2(S, C)

R_bar  <- sqrt(S^2 + C^2) / n

delta_hat <- (1 - sum(cos(2 * (th-mu_hat)))/n) / (2 * R_bar^2)

constant  <- 0.8

CI <- mu_hat + c(-1, 1) * constant * delta_hat

lines(x = c(0, cos(CI[1])), 
      y = c(0, sin(CI[1])), col="green")

lines(x = c(0, cos(CI[2])), 
      y = c(0, sin(CI[2])), col="blue")

Cut-off values for c=.8 with a uniform half-circle from a sample of 200.

In a final note, it may still be better to use the additional information provided by the value of the reading, and not only the sign, because it may provide better estimates. However, the simplification of only using the sign makes the problem much more manageable. If anyone has a good solution that incorporates the readings, I would love to know!

Best Answer

Related Solutions

Solved – Is it incorrect to calculate the mean and standard deviation of percentages

Solved – compute mean and standard deviation with units