Solved – Intuition behind sparsity in over-complete sparse auto-encoders

autoencodersdeep learningmachine learningunsupervised learning

I am trying to get a grasp of the intuition behind the sparse representation used in over-complete auto-encoders. One piece of text that offers a somewhat intuitive explanation is from Yoshua Bengio's Learning Deep Architectures for AI:

Suppose that the representation learned by an auto-encoder is sparse, then the auto-encoder cannot reconstruct well every possible input pattern, because the number of sparse configurations is necessarily smaller than the number of dense configurations.

I have also consulted Stanford's Unsupervised Feature Learning and Deep Learning, which wasn't quite helpful. I feel that I require more reading material in order to get a firmer grasp on the topic. Are there any recommendations or resources which might help?

Best Answer

Are you familiar with the notion of over-fitting? The problem here is that a complete auto-encoder has too many degrees of freedom, so it can easily match any input data without constraining how it must match unseen data. Without such a constraint, there is nothing pushing the learning system to find a hypothesis which matches the training data and makes predictions about unseen data in a way that matches the "pattern" followed by the training data.

To make this concrete, imagine a very constrained classifier that must find a linear relationship between its single input variable 'x' and its output variable 'y'. With a small bit of data, this learner can find the best possible linear relationship between x and y. Contrast this with a learner that is allows to output and arbitrarily long list of x,y pairs, and it will output the 'y' value closest to the nearest 'x' value.

if the data is very noisy, then even if there is a pretty good linear relationship between x and y, it will do a bad job of predicting the output y. this is because the nearest x value in the training data is pretty noisy, so it may be anywhere.

This learning system 'overfit' the data since it has very precisely modeled the training data (getting a perfect score, by putting a x/y pair ontop of each training data point.) But this hypothesis space was too flexible, the learning algorithm was not forced to select from a constrained set of patterns (like linear relations between x and y), thus the learning algorithm never learned anything interesting about the unseen points, in merely copied the exact shape of the training data into the learned model.

In an analogous way dense auto encoders allow perfect encoding of the any training data, thus providing no constraint to drive generalization.