Can you be more specific about the types of data you are looking at? This will in part determine what type of algorithm will converge the fastest.
I'm also not sure how to compare methods like boosting and DL, as boosting is really just a collection of methods. What other algorithms are you using with the boosting?
In general, DL techniques can be described as layers of encoder/decoders. Unsupervised pre-training works by first pre-training each layer by encoding the signal, decoding the signal, then measuring the reconstruction error. Tuning can then be used to get better performance (e.g. if you use denoising stacked-autoencoders you can use back-propagation).
One good starting point for DL theory is:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.795&rep=rep1&type=pdf
as well as these:
http://portal.acm.org/citation.cfm?id=1756025
(sorry, had to delete last link due to SPAM filtration system)
I didn't include any information on RBMs, but they are closely related (though personally a little more difficult to understand at first).
Another good introduction on the subject is the CSC321 course at the University of Toronto, and the neuralnets-2012-001 course on Coursera, both taught by Geoffrey Hinton.
From the video on Belief Nets:
Graphical models
Early graphical models used experts to define the graph structure and the conditional probabilities. The graphs were sparsely connected, and the focus was on performing correct inference, and not on learning (the knowledge came from the experts).
Neural networks
For neural nets, learning was central. Hard-wiring the knowledge was not cool (OK, maybe a little bit). Learning came from learning the training data, not from experts. Neural networks did not aim for interpretability of sparse connectivity to make inference easy. Nevertheless, there are neural network versions of belief nets.
My understanding is that belief nets are usually too densely connected, and their cliques are too large, to be interpretable. Belief nets use the sigmoid function to integrate inputs, while continuous graphical models typically use the Gaussian function. The sigmoid makes the network easier to train, but it is more difficult to interpret in terms of the probability. I believe both are in the exponential family.
I am far from an expert on this, but the lecture notes and videos are a great resource.
Best Answer
To fully answer this question, it would require a lot of pages here. Don't forget, stackexchange is not a textbook from which people read for you.
Material for you:
These explanations are by far not complete but hopefully correct. If you want to understand this field, you have to read a lot more than this.