Solved – How to understand a convolutional deep belief network for audio classification

classificationdeep-belief-networksintuitionunsupervised learning

In "Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations" by Lee et. al.(PDF) Convolutional DBN's are proposed. Also the method is evaluated for image classification. This sounds logical, as there are natural local image features, like small corners and edges etc.

In "Unsupervised feature learning for audio classification using convolutional deep belief networks" by Lee et. al. this method is applied for audio in different types of classifications. Speaker identification, gender indentification, phone classification and also some music genre / artist classification.

How can the convolutional part of this network be interpreted for audio, like it can be explained for images as edges?

Best Answer

The audio application is a one-dimensional simplification of the two-dimensional image classification problem. A phoneme (for example) is the audio analog of an image feature such as an edge or a circle. In either case such features have an essential locality: they are characterized by values within a relatively small neighborhood of an image location or moment of speech. Convolutions are a controlled, regular form of weighted averaging of values within local neighborhoods. From this originates the hope that a convolutional form of a DBN can be successful at identifying and discriminating features that are meaningful.

Related Question