I'm confused. Is there a difference between Deep belief networks and Deep Boltzmann Machines? If so, what's the difference?
Solved – Deep belief networks or Deep Boltzmann Machines
deep learningdeep-belief-networksmachine learningrestricted-boltzmann-machine
Related Solutions
Auto-encoders typically feature many hidden layers. This causes a variety of problems for the common backpropagation-style training methods, because the backpropagated errors become very small in the first few layers.
A solution is to do pretraining, e.g. use initial weights that approximate the final solution. One pretraining technique treats a set of two layers like an RBM to obtain a good set of starting weights which are then fine-tuned using backpropagation. RBMs are useful here because contrastive divergence does not suffer from the same issues as backpropagation.
Autoencoder is a simple 3-layer neural network where output units are directly connected back to input units. E.g. in a network like this:
output[i]
has edge back to input[i]
for every i
. Typically, number of hidden units is much less then number of visible (input/output) ones. As a result, when you pass data through such a network, it first compresses (encodes) input vector to "fit" in a smaller representation, and then tries to reconstruct (decode) it back. The task of training is to minimize an error or reconstruction, i.e. find the most efficient compact representation (encoding) for input data.
RBM shares similar idea, but uses stochastic approach. Instead of deterministic (e.g. logistic or ReLU) it uses stochastic units with particular (usually binary of Gaussian) distribution. Learning procedure consists of several steps of Gibbs sampling (propagate: sample hiddens given visibles; reconstruct: sample visibles given hiddens; repeat) and adjusting the weights to minimize reconstruction error.
Intuition behind RBMs is that there are some visible random variables (e.g. film reviews from different users) and some hidden variables (like film genres or other internal features), and the task of training is to find out how these two sets of variables are actually connected to each other (more on this example may be found here).
Convolutional Neural Networks are somewhat similar to these two, but instead of learning single global weight matrix between two layers, they aim to find a set of locally connected neurons. CNNs are mostly used in image recognition. Their name comes from "convolution" operator or simply "filter". In short, filters are an easy way to perform complex operation by means of simple change of a convolution kernel. Apply Gaussian blur kernel and you'll get it smoothed. Apply Canny kernel and you'll see all edges. Apply Gabor kernel to get gradient features.
(image from here)
The goal of convolutional neural networks is not to use one of predefined kernels, but instead to learn data-specific kernels. The idea is the same as with autoencoders or RBMs - translate many low-level features (e.g. user reviews or image pixels) to the compressed high-level representation (e.g. film genres or edges) - but now weights are learned only from neurons that are spatially close to each other.
All three models have their use cases, pros and cons, but probably the most important properties are:
- Autoencoders are simplest ones. They are intuitively understandable, easy to implement and to reason about (e.g. it's much easier to find good meta-parameters for them than for RBMs).
- RBMs are generative. That is, unlike autoencoders that only discriminate some data vectors in favour of others, RBMs can also generate new data with given joined distribution. They are also considered more feature-rich and flexible.
- CNNs are very specific model that is mostly used for very specific task (though pretty popular task). Most of the top-level algorithms in image recognition are somehow based on CNNs today, but outside that niche they are hardly applicable (e.g. what's the reason to use convolution for film review analysis?).
UPD.
Dimensionality reduction
When we represent some object as a vector of $n$ elements, we say that this is a vector in $n$-dimensional space. Thus, dimensionality reduction refers to a process of refining data in such a way, that each data vector $x$ is translated into another vector $x'$ in an $m$-dimensional space (vector with $m$ elements), where $m < n$. Probably the most common way of doing this is PCA. Roughly speaking, PCA finds "internal axes" of a dataset (called "components") and sorts them by their importance. First $m$ most important components are then used as new basis. Each of these components may be thought of as a high-level feature, describing data vectors better than original axes.
Both - autoencoders and RBMs - do the same thing. Taking a vector in $n$-dimensional space they translate it into an $m$-dimensional one, trying to keep as much important information as possible and, at the same time, remove noise. If training of autoencoder/RBM was successful, each element of resulting vector (i.e. each hidden unit) represents something important about the object - shape of an eyebrow in an image, genre of a film, field of study in scientific article, etc. You take lots of noisy data as an input and produce much less data in a much more efficient representation.
Deep architectures
So, if we already had PCA, why the hell did we come up with autoencoders and RBMs? It turns out that PCA only allows linear transformation of a data vectors. That is, having $m$ principal components $c_1..c_m$, you can represent only vectors $x=\sum_{i=1}^{m}w_ic_i$. This is pretty good already, but not always enough. No matter, how many times you will apply PCA to a data - relationship will always stay linear.
Autoencoders and RBMs, on other hand, are non-linear by the nature, and thus, they can learn more complicated relations between visible and hidden units. Moreover, they can be stacked, which makes them even more powerful. E.g. you train RBM with $n$ visible and $m$ hidden units, then you put another RBM with $m$ visible and $k$ hidden units on top of the first one and train it too, etc. And exactly the same way with autoencoders.
But you don't just add new layers. On each layer you try to learn best possible representation for a data from the previous one:
On the image above there's an example of such a deep network. We start with ordinary pixels, proceed with simple filters, then with face elements and finally end up with entire faces! This is the essence of deep learning.
Now note, that at this example we worked with image data and sequentially took larger and larger areas of spatially close pixels. Doesn't it sound similar? Yes, because it's an example of deep convolutional network. Be it based on autoencoders or RBMs, it uses convolution to stress importance of locality. That's why CNNs are somewhat distinct from autoencoders and RBMs.
Classification
None of models mentioned here work as classification algorithms per se. Instead, they are used for pretraining - learning transformations from low-level and hard-to-consume representation (like pixels) to a high-level one. Once deep (or maybe not that deep) network is pretrained, input vectors are transformed to a better representation and resulting vectors are finally passed to real classifier (such as SVM or logistic regression). In an image above it means that at the very bottom there's one more component that actually does classification.
Best Answer
Although Deep Belief Networks (DBNs) and Deep Boltzmann Machines (DBMs) diagrammatically look very similar, they are actually qualitatively very different. This is because DBNs are directed and DBMs are undirected. If we wanted to fit them into the broader ML picture we could say DBNs are sigmoid belief networks with many densely connected layers of latent variables and DBMs are markov random fields with many densely connected layers of latent variables.
As such they inherit all the properties of these models. For example, in a DBN computing $P(v|h)$, where $v$ is the visible layer and $h$ are the hidden variables is easy. On the other hand computing $P$ of anything is normally computationally infeasible in a DBM because of the intractable partition function.
That being said there are similarities. For example:
References