Solved – Machine-Learning algorithms for Forecasting

forecastingmachine learningrtime series

For work, I'm working on an app where you essentially forecast the failure rate of the overall machine through different factors such as the historical failure rates for the components used to build it or the failure rates of the factories that manufacture it, or even the historical rate for the machine itself. The idea is that for any machine you can make a solid prediction, so I need some algorithm to self-build a good model for each of the 1000s of machines.

I've been able to implement this using ARIMAX models, but I just don't feel good about using auto.arima and then just cross-validating to see how many external regressors to add in. I've also tried SVM, but what seemed to happen was that the model was not good at dropping irrelevant factors, and therefore the prediction was a flat line.

I feel like boosting would be a promising area, but I was wondering if anyone had other options and could more importantly, point me to examples of how the specific algorithm was implemented in R? I'm actually an undergrad intern majoring in statistics, so I'm not too strong in the actual programming side of things, so am not very good at implementing the theory I read about into R code.

Also, would a normal GLM be good enough? I used ARIMAX because I wanted to correct for autocorrelation.

Best Answer

You already use two different forecasting methods, which is very good. In my experience, fiddling around with ever more different methods will rarely yield dramatic changes in forecasting accuracy.

Usually, it is far more important to ensure that your data are cleaned well. You mention large numbers of time series, so you will have to do your cleansing automatically. I'd recommend spending a significant amount of time in understanding the kind of data issues (outliers, missing data, ...) you see in your data and how to address these automatically - independently of any forecasting method.

In addition, I'd recommend seeing what you can learn from the two forecasting methods you are using. How many external factors does ARIMAX typically include? Maybe your SVN is correct about dropping most of them - is the SVN better or worse than ARIMAX on a holdout sample? Conversely, what AR, I and MA orders does auto.arima() typically choose? If you constrain all of these to be zero (essentially removing all autocorrelation from your model), how much worse is accuracy on a holdout sample? Maybe there really isn't all that much autocorrelation to worry about, in which case you might really just go with an ordinary GLM. Or just go ahead and fit a GLM and see how it performs on a holdout sample.

(You may have noticed that I harp on holdout samples. Don't make the error of assessing in-sample accuracy.)

I like to recommend this open online forecasting textbook, which also has some rudiments on vector autoregression and neural net forecasting (with R code!), in case you do want to check out these. The whole book is very much worth reading. I don't know of anyone using boosting in forecasting, so while I certainly don't see anything wrong with it, I'd recommend that you first pick the low-hanging fruit I describe above.

Related Question