Solved – Choosing the parameters for an artificial neural network for time-series regression in R

autocorrelationmachine learningneural networksrregression

I'm trying to build an artificial neural network (ANN) using the R "neuralnet" package, to predict streamflow from snow albedo (reflectance of the snow; controls the amount of heat absorbed by the snow, and therefore controls its melt), precipitation, air temperature, and a temporal variable, 'day of the year'. All the above variables are time-series with 4383 values each between the years 2002 and 2013, with daily temporal frequency. Simpler statistical models have not worked because of the complex autocorrelated and lagged relationship between the predictor and predictors.

I have the following questions about building the ANN:

  1. What number of hidden layers should I use when building the model? How does this choice affect the model?
  2. Should I change the 'threshold' parameter from the default (0.01)? My flow values fluctuate between 0.5 and 134.82 cubic-foot per second (cfs), with a mean of 8.5 cfs. Ideally, my flow prediction errors should not be greater than 1-2 cfs.
  3. Will the choice of algorithm affect the prediction accuracy?
  4. Should I change any of the other parameters in neuralnet?
  5. Should I include any other variables (lags, temporal variables etc) in my model?

Also, would any other machine learning or statistical method be more suitable for this task? My data is highly non-linear, with some seasonality every year, and PACF and CCF plots indicate lags at all lag periods between -30 and +30 days.

I would be happy to answer any questions about the data, or about what I've already tried (GAMs, GLMs, Decision Trees and Random Forests).
Thank you.

Best Answer

I think this question is a bit too grand in scope. I'll point out a few things I think will be helpful and then try to answer your more specific questions afterward.

Firstly, your problem is somewhat ill-defined -- the first questions should be qualitative ones, that is, what are you predicting, and what do you believe are informative predictors? It seems unlikely to me that any of those daily unfiltered predictors have any discernible relationship with the unfiltered daily response. Rather, I would expect that some filtering (perhaps a moving average of some sort) of the inputs and output might yield a relationship. As a concrete example, I would find it reasonable that the 10-day average albedo might be strongly correlated with the 10-day average flow. It all depends on how long it takes for melting snow (or precipitation, etc.) to reach the point of flow measurement. Unless this is something you've already figured out or addressed with some rigor, I would recommend going back and trying GAMs again after filtering your predictors and response. In the natural sciences, where there may be some a priori sense of which variable interactions are important, these methods can be pretty effective. And even if not, you may learn something useful that will guide your next steps.

That aside:

1) (Note: I'm assuming you are dealing strictly with sigmoidal hidden neurons) Start with one. Given the data as you describe it, it's highly unlikely that you will benefit from additional layers. Multiple layers are generally appropriate only for vast datasets in which complex variable interactions are expected. There are a ton of heuristics out there for selecting the number of neurons and layers, and almost all of them were derived from one specific problem and are therefore false. Nevertheless, I maintain that a single hidden layer (or zero hidden layer, as per my earlier suggestion to revisit simpler modeling tools) network is more than capable of capturing the predictor/response relationships in your data -- provided the data are preprocessed appropriately.

2) Based on my reading of the 'neuralnet' PDF (https://cran.r-project.org/web/packages/neuralnet/neuralnet.pdf), it appears that the 'threshold' argument is really a relative tolerance on the loss improvement. There's really no way to answer this question without some trial and error. I would plot the value of the loss function against the number of iterations and look for a relationship. As the threshold value gets smaller, the value of the (training) loss at the end of training will tend to decrease with it. Note that this doesn't necessarily indicate a better model; see my comments below regarding over-fitting.

3) Not sure what you mean by "choice of algorithm" -- do you mean the algorithm used to compute gradients and updates? If so, the answer is a definitive "maybe."

4) As for the other parameters, there is no succinct answer to this. Generally some judiciously applied L2 weight regularization is a good bet. But it depends on your stopping criteria and model validation technique as well.

5) Again, a definitive "maybe." As the subject matter expert I think this is something you will be able to answer for yourself with some subject-specific analysis; "data" people can't help you here.

Additional comments:

Stopping criteria and over-fitting: Are you using in-sample loss to determine when to stop training? I would highly discourage doing so, because a sufficiently "wide" (number of neurons) network may just be able to effectively memorize (over-fit) the training set and give you extremely small loss values, while being totally unable to generalize to new/unseen data. Even if you keep some percentage of the data as a held-out validation set, there is still a great possibility of the aforementioned outcome, because there are so many "knobs" (researcher degrees of freedom) that you may over-fit to the validation set.

Preprocessing: (Note: the following is in addition to the filtering I discussed above, and should be performed after the filtering) Due to the way neural networks are trained, you will want to center and scale all of your predictor and response data (this is especially true when using weight regularization, for reasons beyond the scope of your question and this answer). I recommend converting them to z-scores by subtracting their means and dividing by their standard deviations prior to training. Then when you go to make predictions, you apply (to the new data) the scaling/centering parameters determined during training. This is important because axon weights are drawn from the same random distributions, so if one predictor is on a vastly different scale than another, the initialization process may systematically over/underestimate the weights for one or more inputs. This can really screw up the training process. You should also look into the SD of the distribution from which the initial weight values are drawn; this can affect model performance as well as training time.

Related Question