Solved – Number of hidden units in Restricted Boltzmann Machine

machine learningrestricted-boltzmann-machine

In section 12.1 of Geoff Hinton's Practical Guide to Training RBM on how to choose the number of hidden units it is stated that one should

estimate how many bits it would take to describe each data-vector if you were using a good model (i.e. estimate the typical negative log2 probability of a datavector under a good model)."

What exactly does this mean? How does one estimate a negative log2 probability of a data vector?

Best Answer

That text, if it's this one, seems to be alluding to information theory entropy. By definition, it's the expectation of negative log of probability, in which case "find the expectation" is what he means by "estimate."

For some intuition on why this measure makes sense, consider this from the wiki introduction:

The entropy of the message is its amount of uncertainty; it increases when the message is closer to random, and decreases when it is less random. The idea here is that the less likely an event is, the more information it provides when it occurs.

In essence, Hinton seems to suggest that the more uncertain you are of the inputs, the more hidden units you should have. This gels well with his conclusion to the same paragraph:

If the training cases are highly redundant, as they typically will be for very big training sets, you need to use fewer parameters.

To the question of calculation, I don't see how one can do this without assuming some distribution for the vector. For instance, say your data are images of digits from 0-9, and that your input encodes pixels, in the form of a 256-length vector of binary values. If you assume every pixel is as likely to be a 0 as a 1, then this value will be 256.

But say you learn that the values nearer the corners are less likely, sensible given that most writers of western digits won't near the corners. If you adjust those probabilities down and the others up, you'll find that entropy falls slightly. That is, your measure of uncertainty drops. This example is a rather ad hoc application of subjective domain knowledge, but you could also be more rigorous. E.g., estimate the distribution from the training data.