There are a lot of technicians to optimize your learning like Momentum, RPROP and other. More infromation you can check at paper Yann LeCunn - Efficient BackProp. Also there are a lot of heuristic method for learning, for example you can use random weights from the "standard normal" distribution or make you learning rate different for layers or even for every synapse.
Also, you can find some technics which can train your network weights without learning iterations (but it's not a GD). For this issues you need computtion of inverse matrix or pseudo-inverse matrix. XOR is not linear separateble problem so you can't do it in classic Linear Algebra way, you must change you data set space in some way, for example you can use Radial Basis Functions (RBF).
I think this question is a bit too grand in scope. I'll point out a few things I think will be helpful and then try to answer your more specific questions afterward.
Firstly, your problem is somewhat ill-defined -- the first questions should be qualitative ones, that is, what are you predicting, and what do you believe are informative predictors? It seems unlikely to me that any of those daily unfiltered predictors have any discernible relationship with the unfiltered daily response. Rather, I would expect that some filtering (perhaps a moving average of some sort) of the inputs and output might yield a relationship. As a concrete example, I would find it reasonable that the 10-day average albedo might be strongly correlated with the 10-day average flow. It all depends on how long it takes for melting snow (or precipitation, etc.) to reach the point of flow measurement. Unless this is something you've already figured out or addressed with some rigor, I would recommend going back and trying GAMs again after filtering your predictors and response. In the natural sciences, where there may be some a priori sense of which variable interactions are important, these methods can be pretty effective. And even if not, you may learn something useful that will guide your next steps.
That aside:
1) (Note: I'm assuming you are dealing strictly with sigmoidal hidden neurons) Start with one. Given the data as you describe it, it's highly unlikely that you will benefit from additional layers. Multiple layers are generally appropriate only for vast datasets in which complex variable interactions are expected. There are a ton of heuristics out there for selecting the number of neurons and layers, and almost all of them were derived from one specific problem and are therefore false. Nevertheless, I maintain that a single hidden layer (or zero hidden layer, as per my earlier suggestion to revisit simpler modeling tools) network is more than capable of capturing the predictor/response relationships in your data -- provided the data are preprocessed appropriately.
2) Based on my reading of the 'neuralnet' PDF (https://cran.r-project.org/web/packages/neuralnet/neuralnet.pdf), it appears that the 'threshold' argument is really a relative tolerance on the loss improvement. There's really no way to answer this question without some trial and error. I would plot the value of the loss function against the number of iterations and look for a relationship. As the threshold value gets smaller, the value of the (training) loss at the end of training will tend to decrease with it. Note that this doesn't necessarily indicate a better model; see my comments below regarding over-fitting.
3) Not sure what you mean by "choice of algorithm" -- do you mean the algorithm used to compute gradients and updates? If so, the answer is a definitive "maybe."
4) As for the other parameters, there is no succinct answer to this. Generally some judiciously applied L2 weight regularization is a good bet. But it depends on your stopping criteria and model validation technique as well.
5) Again, a definitive "maybe." As the subject matter expert I think this is something you will be able to answer for yourself with some subject-specific analysis; "data" people can't help you here.
Additional comments:
Stopping criteria and over-fitting: Are you using in-sample loss to determine when to stop training? I would highly discourage doing so, because a sufficiently "wide" (number of neurons) network may just be able to effectively memorize (over-fit) the training set and give you extremely small loss values, while being totally unable to generalize to new/unseen data. Even if you keep some percentage of the data as a held-out validation set, there is still a great possibility of the aforementioned outcome, because there are so many "knobs" (researcher degrees of freedom) that you may over-fit to the validation set.
Preprocessing: (Note: the following is in addition to the filtering I discussed above, and should be performed after the filtering) Due to the way neural networks are trained, you will want to center and scale all of your predictor and response data (this is especially true when using weight regularization, for reasons beyond the scope of your question and this answer). I recommend converting them to z-scores by subtracting their means and dividing by their standard deviations prior to training. Then when you go to make predictions, you apply (to the new data) the scaling/centering parameters determined during training. This is important because axon weights are drawn from the same random distributions, so if one predictor is on a vastly different scale than another, the initialization process may systematically over/underestimate the weights for one or more inputs. This can really screw up the training process. You should also look into the SD of the distribution from which the initial weight values are drawn; this can affect model performance as well as training time.
Best Answer
Jürgen Schmidhuber, "Deep Learning in Neural Networks: An Overview" traces the history of key concepts in neural networks and deep learning. In his view, neural networks would appear to encompass essentially any model which can be characterized as a directed graph where each node represents some computational unit. Schmidhuber is a prominent neural networks researcher, and wrote the original paper on LSTM networks with Sepp Hochreiter.
On the other hand, I'm not sure that it's necessarily profitable to try and construct a taxonomy of mutually-exclusive buckets for machine learning strategies. I think we can say that there are perspectives from which models can be viewed as neural networks. I don't think that perspective is necessarily the best or useful in all contexts. For example, I'm still planning to refer to random forests and gradient boosted trees as "tree ensembles" instead of abstracting away their distinctions and calling them "neural network trees". Moreover, Schmidhuber distinguishes NNs from kernel machines -- even though kernel machines have some connections to NNs -- when he writes "In the new millennium, deep NNs have finally attracted wide-spread attention, mainly by outperforming alternative machine learning methods such as kernel machines ... in numerous important applications. "