Remove outliers, then standardise. This way all the batches of your "good data" will be scaled consistently.
Downsampling, as in removing data points, seems rather sketchy. If required, you could always just do stratified sampling instead etc.
@Automated pipelines
You have too many outliers, remove them and you remove an important chunk of the dataset. Or, those outliers were really important, and suddenly the predictions are bad.
I find using automated pre-processing of features about as feasible as flying planes without a pilot. You can set up the pipeline, but if the output really matters, you will always have to check yourself.
This trend of normalizing the data has some nice properties, dating back to the original conv nets and to modern conv nets that Google used in the ImageNet competitions.
Very briefly, it was shown that doing ZCA whitening, which decorrelates the input's features (so you zero center and the feature correlation matrix is roughly diagonal, though not necessarily the identity matrix due to numerical precision issues) would speed up training in Gradient-Based Learning Applied to Document Recognition, one of the landmark papers in deep learning and conv nets.
Recently Google has published their Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift article, which zero-centers and normalizes standard deviations of the intermediate activation features (before they get fed into the nonlinearity), though not necessarily decorrelating the intermediate activation features. They obtained a pretty remarkable speed up in training speed and this has been used in other cutting edge conv nets.
Either way, it seems like this notion of normalizing (inputs from the LeCun et. al. article and later even the intermediate network activations from the Szegedy and Ioffe article) is important for speeding up training.
But those methods were tested on image data. Apparently your data is not image data.
In my opinion these are merely rules of thumb not laws of the land, and they are rules of thumb that have been primarily tested on specific types of datasets. If you feel your data set is a bit non-standard, just try using various methods, also try multi-layer perceptrons with ReLU activations, and see what works. However, I'm not sure if conv nets are ideal for non-image data. The convolutional layer is inspired by biological vision systems and is a bit of a "strong prior" (at least in my opinion).
If you are indeed worried about your intermediate activation values being too large, you can try Batch Normalization. It standardizes the activation outputs (before they get fed into a nonlinearity). It's whole purpose is to avoid what you have referenced.
One of the main Lasagne developers f0k has coded up a batch normalization layer Batch Normalization for Lasagne. Not sure if you use this package, but it might be worth looking into.
Finally note that batch normalization can be used in any sort of deep network: conv net (which it has primarily been used to train deep conv nets faster) but also multi-layer perceptrons as well.
Best Answer
First off, standardization usually is taken to be
subtraction of the mean
division by the standard deviation.
The result has a mean 0 and standard deviation of 1.
Dividing by the variance will be wrong for any variable that is not a pure number. One of the reasons for standardization is to remove any influence of the units of measurement. The standard deviation always has the same units of measurement as the variable itself and division washes out those units.
There is no reason in principle why e.g. subtraction of the median and division by the interquartile range or in general any scaling
(value - measure of level) / measure of scale
might not be useful, but using mean and SD is by far the most common procedure. The idea that the Gaussian or normal is a reference distribution often underlies this, but using measures of level and scale other than the mean and standard deviation would often be useful, especially if you were interested in simple methods for identifying outliers (a very big topic covered by many threads in this forum).
The answer to your general question is pretty much tautologous: standardization is useful whenever difference in level, scale or units of measurement would obscure what you want to see. If you are interested in relative variations, standardize first.
If you wanted to compare the heights of mean and women, the units of measurement should be the same (metres or inches, whatever), and standardization is not required. But if the scientific or practical question requires comparing values relative to the mean, subtract the mean first. If it requires adjusting for different amounts of variability, divide by the standard deviation too.
Freedman, D., Pisani, R., Purves, R. Statistics New York: W.W. Norton (any edition) is good on this topic.