Is the inference time directly proportional to both the number of operations in a network as well as the number of parameters? Or is it directly proportional to no. of operations and indirectly proportional to no. of parameters?
Solved – Inference Time in Neural Networks
conv-neural-networkmachine learningneural networks
Related Solutions
I have never worked with recurrent networks, but from what I know, in practice, some RNN and TDNN can be used for the same purpose that you want: Predict time series values. However, they work different.
It is possible with TDNN:
- Predict process' values
- Find a relationship between two processes.
Some RNN, like NARX also allow you to do that, and it is also used to predict financial time series, usually better than TDNN.
A TDNN looks more like a feedforward network, because time aspect is only inserted through its inputs, unlike NARX that also needs the predicted/real future value as input. This characteristic makes TDNN less robust than NARX for predicting values, but requires less processing and is easier to train.
If you are trying to find a relationship between a process $X(t)$ and a process $Y(t)$, NARX requires you to have past values of $Y$, while TDNN does not.
I recommend reading Simon Haykin's Neural Networks: A Comprehensive Foundation (2nd Edition) and this FAQ. There are lots of neural networks architectures and variations. Sometimes they have many names or there is no consensus about their classification.
As a disclaimer, I work on neural nets in my research, but I generally use relatively small, shallow neural nets rather than the really deep networks at the cutting edge of research you cite in your question. I am not an expert on the quirks and peculiarities of very deep networks and I will defer to someone who is.
First, in principle, there is no reason you need deep neural nets at all. A sufficiently wide neural network with just a single hidden layer can approximate any (reasonable) function given enough training data. There are, however, a few difficulties with using an extremely wide, shallow network. The main issue is that these very wide, shallow networks are very good at memorization, but not so good at generalization. So, if you train the network with every possible input value, a super wide network could eventually memorize the corresponding output value that you want. But that's not useful because for any practical application you won't have every possible input value to train with.
The advantage of multiple layers is that they can learn features at various levels of abstraction. For example, if you train a deep convolutional neural network to classify images, you will find that the first layer will train itself to recognize very basic things like edges, the next layer will train itself to recognize collections of edges such as shapes, the next layer will train itself to recognize collections of shapes like eyes or noses, and the next layer will learn even higher-order features like faces. Multiple layers are much better at generalizing because they learn all the intermediate features between the raw data and the high-level classification.
So that explains why you might use a deep network rather than a very wide but shallow network. But why not a very deep, very wide network? I think the answer there is that you want your network to be as small as possible to produce good results. As you increase the size of the network, you're really just introducing more parameters that your network needs to learn, and hence increasing the chances of overfitting. If you build a very wide, very deep network, you run the chance of each layer just memorizing what you want the output to be, and you end up with a neural network that fails to generalize to new data.
Aside from the specter of overfitting, the wider your network, the longer it will take to train. Deep networks already can be very computationally expensive to train, so there's a strong incentive to make them wide enough that they work well, but no wider.
Best Answer
There's no reason time should be proportional to the number of parameters, for example, you could imagine a fully connected one layer network which computes $y = \sigma(w^Tx)$ and a simple RNN which computes $y_0 = 0, y_i = \sigma(y_{i-1}+wx_i)$. Both involve roughly $n$ multiplications / additions, but the RNN only has a single parameter, and the fully connected network has $n$.
As for whether time is proportional to number of operations, that depends on what you mean by "time". Sometimes when computer scientists talk about time, we're interested in some theoretical measure of how long it takes to compute something, and this measure is usually defined as "number of operations". So it's tautological that the run time is proportional to the number of operations.
On the other hand, if you care about runtime in real life, then there's no straightforward relationship between number of operations and time. The fully connected network described above can be very easily parallelized and the dot product effectively done in a single "cycle", whereas the RNN output needs to be computed sequentially, taking $n$ cycles.