Solved – Probabilistic programming vs “traditional” ML

bayesianmachine learningmarkov-chain-montecarloprobabilistic-programmingpymc

I was browsing the github repo for Pymc and found this notebook:

Variational Inference: Bayesian Neural Networks

The author extols the virtues of bayesian/probabilistic programming but then goes on to say:

Unfortunately, when it comes to traditional ML problems like classification or (non-linear) regression, Probabilistic Programming often plays second fiddle (in terms of accuracy and scalability) to more algorithmic approaches like ensemble learning (e.g. random forests or gradient boosted regression trees).

Could someone please explain:

If this statement is generally true
Why this statement is true

Best Answer

It's generally true in my personal experience as a professional data scientist.
It's true in my personal experience because it's what I observe most of the time. If you're asking why it happens this way, it's for a few reasons:
Many traditional ML algorithms are nowadays available "off the shelf", including sophisticated ensemble methods, neural networks, etc. Probabilistic methods still often require bespoke solutions, either written in a DSL like Stan or directly in a general-purpose programming language.
Many people entering data science nowadays are coming from engineering and natural science backgrounds, where they have strong mathematical and "algorithmic" skills but don't have as much experience or intuition with probability modeling. It's just not on their radar, and they aren't as comfortable with the methods and the software required to implement them.
Making a "hard" prediction from a probabilistic model involves either hand-waving or formal decision theory. AI researchers and highly-paid statistical consultants know this, and they embrace it. But for the rank-and-file data scientist, it's not so easy to turn to your manager and start speaking in terms of distributions and probabilities. The business (or the automated system you're building) just needs a damn answer. Life is just a whole lot easier when you stop equivocating about probabilities and stuff, in which case you might as well not bother with them in the first place.
Probabilistic modeling often ends up being very computationally intensive, especially Bayesian modeling where closed-form solutions are a rare luxury, and doubly especially on "big" data sets. I wouldn't hesitate to run XGBoost on a data set with 10 million rows. I wouldn't even consider running a Stan model on a data set with 10 million rows.

Given all the downsides described above, a data scientist or a small team of data scientists can iterate much more quickly using less-probabilistic machine learning techniques, and get "good enough" results.

Edit: as pointed out in the comments, #1 and #2 could both be because probabilistic programming methods haven't yet been shown to have knockout performance on real-world problems. CNNs got popular because they blew away existing techniques.

Edit 2: it seems that probabilistic is getting popular for time series modeling, where deep learning doesn't seem as effective as in other domains.

Edit 3 (December 2020): probabilistic programming is now becoming much more popular in a wide variety of domains, as probabilistic programming languages get better and the solvers get better and faster.

Related Solutions

Solved – Switchpoint detection with probabilistic programming (pymc)

SeanEaster has some good advice. Bayes factor can be difficult to compute, but there are some good blog posts specifically for Bayes factor in PyMC2.

A closly related question is goodness-of-fit of a model. A fair method for this is just inspection - posteriors can give us evidence of goodness-of-fit. Like quoted:

"Had no change occurred, or had the change been gradual over time, the posterior distribution of $\tau$ would have been more spread out"

This is true. The posterior is quite peaked near time 45. As you say > 50% of the mass is at 45, whereas if there was no switch point the mass should (theoretically) be closer to 1/80 = 1.125% at time 45.

What you are aiming to do is faithfully reconstruct the observed data set, given your model. In Chapter 2, their are simulations of generating fake data. If your observed data looks wildly different from your artificial data, then likely your model is not the correct fit.

I apologize for the non-rigourous answer, but really it's a major difficulty that I haven't efficiently overcome.

Solved – PyMC3 Implementation of Probabilistic Matrix Factorization (PMF): MAP produces all 0s

I did two things to fix your code. One was to initialize the model away from zero, the other one was to use a non-gradient based optimizer:

import pymc3 as pm
import numpy as np
import pandas as pd
import theano
import scipy as sp

data = pd.read_csv('jester-dense-subset-100x20.csv')    
n, m = data.shape
test_size = m / 10
train_size = m - test_size

train = data.copy()
train.ix[:,train_size:] = np.nan  # remove test set data
train[train.isnull()] = train.mean().mean()  # mean value imputation
train = train.values

test = data.copy()
test.ix[:,:train_size] = np.nan  # remove train set data
test = test.values    

# Low precision reflects uncertainty; prevents overfitting
alpha_u = alpha_v = 1/np.var(train)
alpha = np.ones((n,m)) * 2  # fixed precision for likelihood function
dim = 10  # dimensionality

# Specify the model.
with pm.Model() as pmf:
    pmf_U = pm.MvNormal('U', mu=0, tau=alpha_u * np.eye(dim),
                        shape=(n, dim), testval=np.random.randn(n, dim)*.01)
    pmf_V = pm.MvNormal('V', mu=0, tau=alpha_v * np.eye(dim),
                        shape=(m, dim), testval=np.random.randn(m, dim)*.01)
    pmf_R = pm.Normal('R', mu=theano.tensor.dot(pmf_U, pmf_V.T),
                      tau=alpha, observed=train)

    # Find mode of posterior using optimization
    start = pm.find_MAP(fmin=sp.optimize.fmin_powell)  # Find starting values by optimization

    step = pm.NUTS(scaling=start)
    trace = pm.sample(500, step, start=start)

This is an interesting model that would make a great contribution. Please consider adding this, once you're certain it works as desired, to the examples folder and do a pull request.

Best Answer

Related Solutions

Solved – Switchpoint detection with probabilistic programming (pymc)

Solved – PyMC3 Implementation of Probabilistic Matrix Factorization (PMF): MAP produces all 0s

Related Question