If you like, you could do this by writing a special processing function for this gist I wrote:
https://gist.github.com/CharlieCodex/f494b27698157ec9a802bc231d8dcf31
import tensorflow as tf
def self_feeding_rnn(cell, seqlen, Hin, Xin, processing=tf.identity):
'''Unroll cell by feeding output (hidden_state) of cell back into in as input.
Outputs are passed through `processing`. It is up to the caller to ensure that the processed
outputs have suitable shape to be input.'''
veclen = tf.shape(Xin)[-1]
# this will grow from [ BATCHSIZE, 0, VELCEN ] to [ BATCHSIZE, SEQLEN, VECLEN ]
buffer = tf.TensorArray(dtype=tf.float32, size=seqlen)
initial_state = (0, Hin, Xin, buffer)
condition = lambda i, *_: i < seqlen
print(initial_state)
def do_time_step(i, state, xo, ta):
Yt, Ht = cell(xo, state)
Yro = processing(Yt)
return (1+i, Ht, Yro, ta.write(i, Yro))
_, Hout, _, final_ta = tf.while_loop(condition, do_time_step, initial_state)
ta_stack = final_ta.stack()
Yo = tf.reshape(ta_stack,shape=((-1, seqlen, veclen)))
return Yo, Hout
If your code is something like:
# how your network might work:
W = tf.Variable(shape=(state_size, 3), ... )
B = tf.Variable(shape=(3,), ... )
Yo, Ho = tf.nn.dynamic_rnn( cell, input, state )
# ( lat lon temp ) 3-vectors
predictions = tf.nn.matmul(Yo, W) + B
You could use the gist as:
# using self_feeding_rnn
from magic import temperature_sampler
def process_yt(yt):
p = tf.nn.matmul(yt, W) + B
real_temp = temperature_sampler[p[...,0],p[...,1]]
# remove final element (temp) and add on proper temp
return tf.concat((p[...,:-1], real_temp), axis=-1)
Yo, Ho = self_feeding_rnn(cell, seed, initial_state, processing=process_yt)
This makes the crux of your problem getting the temperature data into a tensorflow understandable format (some sort of 2D sampler). I have no experience working with such things, but in the worst case, you can just round your lat,lon to integers and grab from a constant array (using tf.constant
, not np.ndarray
so that you can index with tensors).
If you are still working on this I would love to help and feel free to ask me any questions!
This is a great question and there's actually been some research tackling the capacity/depth issues you mentioned.
There's been a lot of evidence that depth in convolutional neural networks has led to learning richer and more diverse feature hierarchies. Empirically we see the best performing nets tend to be "deep": the Oxford VGG-Net had 19 layers, the Google Inception architecture is deep, the Microsoft Deep Residual Network has a reported 152 layers, and these all are obtaining very impressive ImageNet benchmark results.
On the surface, it's a fact that higher capacity models have a tendency to overfit unless you use some sort of regularizer. One way very deep networks overfitting can hurt performance is that they will rapidly approach very low training error in a small number of training epochs, i.e. we cannot train the network for a large number of passes through the dataset. A technique like Dropout, a stochastic regularization technique, allows us to train very deep nets for longer periods of time. This in effect allows us to learn better features and improve our classification accuracy because we get more passes through the training data.
With regards to your first question:
Why can you not just reduce the number of layers / nodes per layer in a deep neural network, and make it work with a smaller amount of data?
If we reduce the training set size, how does that affect the generalization performance? If we use a smaller training set size, this may result in learning a smaller distributed feature representation, and this may hurt our generalization ability. Ultimately, we want to be able to generalize well. Having a larger training set allows us to learn a more diverse distributed feature hierarchy.
With regards to your second question:
Is there a fundamental "minimum number of parameters" that a neural network requires until it "kicks in"? Below a certain number of layers, neural networks do not seem to perform as well as hand-coded features.
Now let's add some nuance to the above discussion about the depth issue. It appears, given where we are at right now with current state of the art, to train a high performance conv net from scratch, some sort of deep architecture is used.
But there's been a string of results that are focused on model compression. So this isn't a direct answer to your question, but it's related. Model compression is interested in the following question: Given a high performance model (in our case let's say a deep conv net), can we compress the model, reducing it's depth or even parameter count, and retain the same performance?
We can view the high performance, high capacity conv net as the teacher. Can we use the teacher to train a more compact student model?
Surprisingly the answer is: yes. There's been a series of results, a good article for the conv net perspective is an article by Rich Caruana and Jimmy Ba
Do Deep Nets Really Need to be Deep?. They are able to train a shallow model to mimic the deeper model, with very little loss in performance. There's been some more work as well on this topic, for example:
among other works. I'm sure I'm missing some other good articles.
To me these sorts of results question how much capacity these shallow models really have. In the Caruana, Ba article, they state the following possibility:
"The results suggest that the strength of deep learning may arise in part from a good match between deep architectures
and current training procedures, and that it may be possible to devise better learning algorithms to train more accurate shallow feed-forward nets. For a given number of parameters, depth may make learning easier, but may not always be essential"
It's important to be clear: in the Caruana, Ba article, they are not training a shallow model from scratch, i.e. training from just the class labels, to obtain state of the art performance. Rather, they train a high performance deep model, and from this model they extract log probabilities for each datapoint. We then train a shallow model to predict these log probabilities. So we do not train the shallow model on the class labels, but rather using these log probabilities.
Nonetheless, it's still quite an interesting result. While this doesn't provide a direct answer to your question, there are some interesting ideas here that are very relevant.
Fundamentally: it's always important to remember that there is a difference between the theoretical "capacity" of a model and finding a good configuration of your model. The latter depends on your optimization methods.
Best Answer
Not all models are sensitive to data normalization. For example, models with batch-norm layer have a built-in mechanism to fix activations distribution. Others are more sensitive and may even diverge just because of lack of normalization (E.g., try to train a CNN on CIFAR-10 dataset with training images, which pixels are in range $[0, 255]$).
But I'm not aware of any model that would suffer from data normalization. So even though the house prediction model (btw, which one exactly?) may not do it, the model is likely to improve if the data is normalized, and you should do it too.
GPS data has roughly these bounds: the latitude is in $[-100, 100]$, the longitude is in $[-200, 200]$. The coordinates for the populated area are much narrower, but it's not it's not a big deal to assume these wide ranges. This means that the transformation...
$$ x \mapsto \frac{x}{100}$$
... will ensure that the latitude is in $[-1, 1]$ and longitude is in $[-2, 2]$ (and very likely in $[-1, 1]$ as well), which are fairly robust ranges for deep learning. The transformation is easy (in
numpy
it takes just one line of code) and doesn't require you to compute the statistics from the training data.