Solved – Assigning Weights to An Averaged Forecast

forecast-combinationforecastingrtime series

So I've been learning how to forecast over this summer and I've been using Rob Hyndman's book Forecasting: principles and practice. I've been using R, but my questions aren't about code. For the data I've been using, I've found that an average forecast of multiple models has produced higher accuracy levels that any sole model by itself.

Recently I read an blog that talked about averaging forecasting methods and assigning weights to them. So in my case, lets say I assign 11 different models to my set of data (Arima, ETS, Holt Winters, naive, snaive, and so forth) and I want to average a few of these to get a forecast. Has anyone had any experience with this or can point me to an article that might give some insight on the best way of going about this?

As of right now, I'm using cross validation and Mean Absolute Error to figure out which models perform best and which perform worst. I can even use this to identify the top k # of models.

I guess my questions are

1) How many models would you suggest selecting? (2,3,4,5,6, etc)

2) Any ideas on weights? (50% to the best, 25% to the second best, 15% third best, 10% to the 4th best, etc)

3) Are any of these forecasting models redundant and shouldn't be included?
(Arima, snaive, naive, HW's "additive", ETS, HoltWinters exponential smoothing, HoltWinters smoothing w/ trend, HoltWinters w/ trend/seasonality, multiple regression)

Best Answer

The answers to your questions in order

How Many Models

Usually as many as you want, but this can be limited by the amount of data you have. Also depends on the method you are using to derive the weights (which I explain more below)

How to Assign Weights

There are many, here are the five most popular off the top of my head, though non of them use Mean absolute error.

  1. Equal Weights for all models
    • pros:
      1. Simple, easy to implement
      2. Often outperforms more complex techniques
      3. You can, in theory, add as many models as you want
    • cons:
      1. May be too oversimplified
      2. No inherent method for ranking models
    • References
      1. Aiolfi, M. and A. Timmermann (2006), “Persistence in Forecasting Performance and Conditional Combination Strategies”, Journal of Econometrics, 35 (1-2), 31-53.
      2. Manescu, Cristiana, and Ine Van Robays. "Forecasting the Brent oil price: addressing time-variation in forecast performance." (2014).
  2. Inverse Mean Square Forecast Error (MSFE) ratio: For $M$ models the combined, $h$-step ahead forecast is $$ \hat y_{t+h}=\sum_{m=1}^{M} w_{m,h,t}\hat y_{t+h,m},\;\;\;w_{m,h,t}=\frac{(1/msfe_{m,h,t})^k}{\sum_{j=1}^M (1/msfe_{j,h,t})^k} $$ where $\hat y_{t+h,m}$ is the point forecast forecast for $h$ steps ahead at time $t$ from model $m$. In most applications $k=1$.
    • pros:
      1. Firm theoretical backing
      2. It's been around for a while and is well accepted in the literature
      3. You can, in theory, add as many models as you want
    • cons:
      1. Based solely on point estimate forecasts, does not consider entire forecast distribution (i.e. most applied models will give us an entire parametric distribution for the forecast, the normal distribution is common, $y_{t+h,m} \sim N(\hat y_{t+h,m},\sigma_{t+h,m})$. many argue that not utilizing this additional parametric information by only considering $\hat y_{t+h,m}$ results in sub-optimal forecasts)
    • References
      1. Bates, John M., and Clive WJ Granger. "The combination of forecasts." Or (1969): 451-468.
      2. Massimiliano Marcellino, . "Forecast pooling for short time series of macroeconomic variables," Working Papers 212, IGIER (Innocenzo Gasparini Institute for Economic Research), Bocconi University (2002).
  3. Bayesian Forecast Combination: For point estimate forecast combination the formula is $$ \hat y_{t+h}=\sum_{m=1}^M w(m|y_1,...,y_t) \hat y_{t+h,m} $$ and the combined forecast distribution is $$ f(y_{t+h}|y_1,...,y_t)=\sum_{m=1}^M w(m|y_1,...,y_t) f_m(y_{t+h}|y_1,...,y_t)$$ where $f_m$ is the $m$th model forecast distribution (a pdf). The weights $w(m|y_1,...,y_t)$ are such that $\sum_{m=1}^{M} w(m|y_1,...,y_t)=1$ and $ w(m|y_1,...,y_t)>0$ for all $m$. The weights can be calculated as either the traditional posterior probability of each $m$ model via Bayesian Model Averaging (in-sample technique similar to BIC, but scaled) or from scaling the predictive likelihood (out-of-sample predictive density) of each model. I forgo showing exactly how to calculate the weights for brevity. If you are curious see references
    • pros:
      1. Considers the entire forecast distribution when calculating weights, not just the point forecast
      2. You can, in theory, add as many models as you want
    • cons:
      1. Requires knowledge of Bayesian inference and estimation which can be quite involved
      2. Assumes that at least one of the $m$ models is the true data generating process, which is a strong assumption.
      3. Requires the researcher to specify priors for the parameters in each forecasting model in addition to a discrete prior over all $m$ models
    • References
      1. Hoeting, Jennifer A., et al. "Bayesian model averaging." In Proceedings of the AAAI Workshop on Integrating Multiple Learned Models. 1998.
      2. Eklund, Jana, and Sune Karlsson. "Forecast combination and model averaging using predictive measures." Econometric Reviews 26.2-4 (2007): 329-363.
      3. Andersson, Michael K., and Sune Karlsson. "Bayesian forecast combination for VAR models." Bayesian Econometrics (2008): 501-524.
  4. Optimal Prediction Pools: Same Idea as Bayesian forecast except that the weights are found by maximizing the following "score function" (WLOG assume h=1) $$ \max_{\mathbf{w}}\sum_{i=1}^{t}\ln\bigg[\sum_{m=1}^{M} w_m f_m(y_i;y_1,...,y_{i-1})\bigg] \quad{(1)}$$ $$s.t.\;\;\sum_{m=1}^{M} w_m=1\;and\; w_m \geq 0\; \forall m $$ where $f_m$ is the predictive density/likelihood of model $m$ that can be calculated with either Bayesian or frequentest methodology (see references for more information on this).
    • pros:
      1. Considers the entire forecast distribution when calculating weights, not just the point forecast
      2. Can be implemented using either frequentest or Bayesian techniques and is usually simpler to estimate than traditional Bayesian forecasting combination
      3. Unlike traditional Bayesian forecast combination it does not need to assume one of the $m$ models is the true data generating process
    • cons:
      1. because equation (1) requires numeric optimization, the amount of models you can include is limited by the amount of data you have available. Further, if some models produce highly correlated forecasts, equation (1) may be very challenging to optimize
    • References
      1. Geweke, John, and Gianni Amisano. "Optimal prediction pools." Journal of Econometrics 164.1 (2011): 130-141.
      2. Durham, Garland, and John Geweke. "Improving asset price prediction when all models are false." Journal of Financial Econometrics 12.2 (2014): 278-306.
  5. Various other point estimate based techniques: (1) an ordinary-least squared estimate of the weights obtained by regressing the actual realized values on the point estimate forecasts ($y_{t+h}=\beta_0 +w_1\hat y_{t+h,1}+...+w_M\hat y_{t+h,M}+u_{t+h}$),(2) trimming approaches that drop the worst performing models form an equally weighted combination, (3) set the weights equal to the percentage of times a forecast has the minimum MSFE, etc.
    • pros and cons: vary depending on technique
    • References
      1. Timmermann, A. (2006), “Forecast Combinations”, Handbook of Economic Forecasting, 1, 135–196.

Are Any of the Forecasting Models Redundant/Excludable

The Holt-Winters models are likly to be similar so maybe throw a couple of those out. Averaging forecasts is like diversifying a financial portfolio, you want your models to be diverse. With some of the above averaging techniques it doesn't hurt to include redundant models, with others it does.

You can also find a friendly introduction here, with couple of more good ways to average forecasts (Constrained Least Squares for example) along with an R implementation.