Solved – Feature standardization for convolutional network on sparse data

data preprocessingdeep learningstandardization

I am preprocessing input data for a convolutional network (ConvNet), which is trained with SGD. For instance, see this quote from Ilya Sutskever (A brief overview of Deep Learning).

It is essential to center the data so that its mean is zero and so
that the variance of each of its dimensions is one. Sometimes, when
the input dimension varies by orders of magnitude, it is better to
take the log(1 + x) of that dimension. Basically, it’s important to
find a faithful encoding of the input with zero mean and sensibly
bounded dimensions. Doing so makes learning work much better.

This makes sense. The issue is that my dataset is highly sparse, meaning that it becomes difficult to obtain both unit variance and sensibly bounded dimensions. In Andrew Ng's coursera course, Machine Learning, he states that such a sensibly bounded interval is $ \left[ -3, 3 \right] $. To obtain unit variance I must divide by the std. deviation, which is roughly 0.25 for each channel, making around 30-50% of my values in each channel fall outside of the "sensibly bounded interval". More importantly, due to the standardization I get some fairly high values in the interval $ \left[ 10, 30 \right] $. I believe these could cause the weights to move far in the wrong direction.

My dataset consists of roughly ~1.4 mio. observations of non-negative data and it is trained using a ConvNet with 8 layers with weights (conv. layers and dense layers). I am not modeling images or text (typical uses of ConvNets). One can imagine that 80% of my values are 0 and the rest are uniformly distributed on $ \left]0,10 \right] $.

I have the following questions:

  1. Based on a knowledge of SGD (and perhaps ConvNets) – is it most important to aim for $ \sigma = 1 $ or a sensible interval such as $ \left[ -3, 3 \right] $?

  2. What useful transformations could I do to fulfill both $ \sigma = 1$ and keeping my values in a sensible interval?

Best Answer

This trend of normalizing the data has some nice properties, dating back to the original conv nets and to modern conv nets that Google used in the ImageNet competitions.

Very briefly, it was shown that doing ZCA whitening, which decorrelates the input's features (so you zero center and the feature correlation matrix is roughly diagonal, though not necessarily the identity matrix due to numerical precision issues) would speed up training in Gradient-Based Learning Applied to Document Recognition, one of the landmark papers in deep learning and conv nets.

Recently Google has published their Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift article, which zero-centers and normalizes standard deviations of the intermediate activation features (before they get fed into the nonlinearity), though not necessarily decorrelating the intermediate activation features. They obtained a pretty remarkable speed up in training speed and this has been used in other cutting edge conv nets.

Either way, it seems like this notion of normalizing (inputs from the LeCun et. al. article and later even the intermediate network activations from the Szegedy and Ioffe article) is important for speeding up training.

But those methods were tested on image data. Apparently your data is not image data.

In my opinion these are merely rules of thumb not laws of the land, and they are rules of thumb that have been primarily tested on specific types of datasets. If you feel your data set is a bit non-standard, just try using various methods, also try multi-layer perceptrons with ReLU activations, and see what works. However, I'm not sure if conv nets are ideal for non-image data. The convolutional layer is inspired by biological vision systems and is a bit of a "strong prior" (at least in my opinion).

If you are indeed worried about your intermediate activation values being too large, you can try Batch Normalization. It standardizes the activation outputs (before they get fed into a nonlinearity). It's whole purpose is to avoid what you have referenced.

One of the main Lasagne developers f0k has coded up a batch normalization layer Batch Normalization for Lasagne. Not sure if you use this package, but it might be worth looking into.

Finally note that batch normalization can be used in any sort of deep network: conv net (which it has primarily been used to train deep conv nets faster) but also multi-layer perceptrons as well.