First note: you really should be also dividing by the standard deviation of each feature (pixel) value as well. Subtracting the mean centers the input to 0, and dividing by the standard deviation makes any scaled feature value the number of standard deviations away from the mean.
To answer your question: Consider how a neural network learns its weights. C(NN)s learn by continually adding gradient error vectors (multiplied by a learning rate) computed from backpropagation to various weight matrices throughout the network as training examples are passed through.
The thing to notice here is the "multiplied by a learning rate".
If we didn't scale our input training vectors, the ranges of our distributions of feature values would likely be different for each feature, and thus the learning rate would cause corrections in each dimension that would differ (proportionally speaking) from one another. We might be over compensating a correction in one weight dimension while undercompensating in another.
This is non-ideal as we might find ourselves in a oscillating (unable to center onto a better maxima in cost(weights) space) state or in a slow moving (traveling too slow to get to a better maxima) state.
It is of course possible to have a per-weight learning rate, but it's yet more hyperparameters to introduce into an already complicated network that we'd also have to optimize to find. Generally learning rates are scalars.
Thus we try to normalize images before using them as input into NN (or any gradient based) algorithm.
Not all models are sensitive to data normalization. For example, models with batch-norm layer have a built-in mechanism to fix activations distribution. Others are more sensitive and may even diverge just because of lack of normalization (E.g., try to train a CNN on CIFAR-10 dataset with training images, which pixels are in range $[0, 255]$).
But I'm not aware of any model that would suffer from data normalization. So even though the house prediction model (btw, which one exactly?) may not do it, the model is likely to improve if the data is normalized, and you should do it too.
GPS data has roughly these bounds: the latitude is in $[-100, 100]$, the longitude is in $[-200, 200]$. The coordinates for the populated area are much narrower, but it's not it's not a big deal to assume these wide ranges. This means that the transformation...
$$ x \mapsto \frac{x}{100}$$
... will ensure that the latitude is in $[-1, 1]$ and longitude is in $[-2, 2]$ (and very likely in $[-1, 1]$ as well), which are fairly robust ranges for deep learning. The transformation is easy (in numpy
it takes just one line of code) and doesn't require you to compute the statistics from the training data.
Best Answer
Subtracting the dataset mean serves to "center" the data. Additionally, you ideally would like to divide by the sttdev of that feature or pixel as well if you want to normalize each feature value to a z-score.
The reason we do both of those things is because in the process of training our network, we're going to be multiplying (weights) and adding to (biases) these initial inputs in order to cause activations that we then backpropogate with the gradients to train the model.
We'd like in this process for each feature to have a similar range so that our gradients don't go out of control (and that we only need one global learning rate multiplier).
Another way you can think about it is deep learning networks traditionally share many parameters - if you didn't scale your inputs in a way that resulted in similarly-ranged feature values (ie: over the whole dataset by subtracting mean) sharing wouldn't happen very easily because to one part of the image weight
w
is a lot and to another it's too small.You will see in some CNN models that per-image whitening is used, which is more along the lines of your thinking.