Solved – Hourly predictions using time series

forecastingmultiple-seasonalitiestime series

I'd like to build a model based on time series. I have a dataset with records every 30 minutes for three months.

What is the difference between modeling these data with the following kinds of models?

  • Extracting hour/week-day/month and use them as features in machine learning algorithms
  • Using ARMA models

My data contains weather information. One of scenarios I am working on is predicting "the use of bikes", it's related to information like weather/temperature/wind/time (day/hour, I think that month doesn't make sense) … In such scenarios, should I use a time series ARMA models or just extract hour/week-day/month and use them as features to apply algorithms like tree/random-forest.

Can any one explain the difference, or point to paper/book to check?

Note: I am self-learner, didn't attend any data science class. Apologies if this is obvious.

Best Answer

Well, the difference is... that they are different methods. ("Can any one explain the difference between apples and oranges?")

  • ARIMA models are explained in any introductory time series book. (I'll never tire of recommending this free open source online forecasting textbook.) If you want to include weather info, you'd need ARIMA models with eXplanatory or eXternal information, or ARIMAX models. These are also standard.

  • Trees/CARTs/Random Forests are explained in any Data Science textbook, or even the Wikipedia pages. These will, of course, model explanatory variables "as-is". Your idea of using days, hours and months as features does make sense in this context. However, simply feeding independent dummies for "9-10am", "10-11am" and so forth into your model may or may not account for the fact that your observations in the 9-10am and the 10-11am time buckets will be more highly correlated than the ones in the 9-10am and the 1-2pm buckets.

A couple of random thoughts:

  • ARIMA(X) will have a hard time dealing with the multiple seasonalities involved (year-over-year, intra-week with people commuting to work Mon-Fri but not Sat/Sun, intra-day with more people biking during the day). You could in principle model these seasonalities using dummies in your ML models. Alternatively, there are a couple of approaches to multiple seasonalities in the context of Exponential Smoothing/State Space models.

  • Weather is of course highly correlated with time-of-year and time-of-day: it's hotter in summer and during the day than in winter and during the night. If you already model seasonality as above, you may find that adding weather information does not improve the forecasts very much beyond what seasonality already does.

  • If you want to forecast something using the weather, remember that you will need weather forecasts, too! Don't assess your out-of-sample forecasts based on how they work with actual weather - you won't know tomorrow's actual weather when you do "production" forecasting. The uncertainty in weather forecasts adds an additional source of uncertainty in your bicycling forecasts. In particular, weather forecasts are not very reliable for more than 15 days out, so they won't be very helpful for forecasting bike rides that far out. (Incidentally, getting historical weather data is far easier and cheaper than getting historical weather forecasts.)

  • You may want to look at the electricity price or load forecasting literature - that use case deals with many of your challenges (high frequency data, multiple seasonalities, weather influence). I haven't read this review yet, but it may be helpful.