Solved – Time series analysis on login data to forecast CPU demand using R

forecastingrtime series

Motivation: I was hired as an intern a few weeks ago to figure out if my company needed to buy new machines six months in advance. Database machines take up to 4 months to install and there is a 2 month grace period.

I signed an NDA, so I don't think I can give any actual data.

The only reliable information I have now, is information on the number of logins and registrations for an education company from 2002 to 2011. I think I can get more recent information on registrations, and people are working on getting login information. We stopped logging login information in 2011 so there will be a gap of no data when I try to forecast 🙁

The information is collected daily.

I've created a time series forecast of the data using R. I used this tutorial
http://a-little-book-of-r-for-time-series.readthedocs.org/en/latest/src/timeseries.html#arima-models To make a holt winters exponential model with daily frequency (frequency = 365). I've removed February 29 from the data. Unfortunately the gap in login data means I will have to try a more specific ARIMA right? Will I be able to use arima if there are long gaps in the data? Also, the arima function in R doesn't allow for frequencies greater than 350, and it runs out of memory quickly, so I'd have to use a monthly model (freq = 12). I have tried using fourier but the predictions didn't look right intuitively. Since I want to know what the peak usages are though, I think I might want to be more specific. Is it ok to use a weekly frequency (freq = 52) and just remove Dec 31?

Is daily frequency allowable? Like can I use exponential smoothing with daily frequency even though Sept 7, 2012 might fall on a Sunday, whereas Sept 7, 2011 and 2010 and 2009 might all be weekdays. There is a daily, weekly, and yearly seasonality in demand and number of logins. Eg. 6pm, and Monday, and September are more loaded in general than 4am, and Saturday, and May. There is a yearly seasonality in number of registrations.

I've been having some issues with the login predictions
The problem is that variability increases too much before 6 months have even passed. At the 80% confidence interval. The projection line extends into 2012 and the orange area is the 80% confidence interval. Logging and using additive exponential smoothing gave me much more variability than multiplicative exponential smoothing.

It's not useful to the company to say that "well you might have 8 jillion logins sometime in the next 6 months and you might have 20% more than you had last year." How do I reduce the variance in the projection?

http://img836.imageshack.us/img836/8460/holtwintersloginmultipl.png

Finally, I was thinking that after I got accurate projections, I'd put logins and registrations in a neural network, and I'd put something like average wait time on a few machines as the ouput variable, and I'd forecast peak projected processing power demand in 6 months. There are other variables to consider, like software releases that change cpu demand per user, but I'm hoping the neural network will learn when these happen, or that they are easy to detect and account for. I don't have any good data on average wait time yet, but assuming I find some, is this a good plan?

Best Answer

Exponential smoothing is just a special case of an ARIMA model. If there is a benefit to fitting a general ARIMA model it is because of its generality and not that it handles gaps in the data any better than exponetial smoothing. I don't see any reason for throwing out February 29th. Individual dates would not have any appriciable effect on seasonality if there is some periodic component to the series. The time unit for time series analysis can be whatever time unit you measure the data in (it could be days weeks or years). You can cumulate data to create longer time intervals for the time series model. The fact that a date in one year falls on a different day of the week than in another year has nothing to do with its utility. If there are weekly effects this can show up in a 7 day periodic component. Gaps in the data does hurt your ability to fit the model. But if a single ARIMA model would have fit well to the complete series you probably can identify it piecing together the available portions of the series keeping count through the time index of the number of days missing at each gap. I don't understand why you can't have days as the time units. Is there a problem with having a long series because it seems to me that the time unit only affects the number of time points im the series?

Related Question