SeanEaster has some good advice. Bayes factor can be difficult to compute, but there are some good blog posts specifically for Bayes factor in PyMC2.
A closly related question is goodness-of-fit of a model. A fair method for this is just inspection - posteriors can give us evidence of goodness-of-fit. Like quoted:
"Had no change occurred, or had the change been gradual over time, the posterior distribution of $\tau$ would have been more spread out"
This is true. The posterior is quite peaked near time 45. As you say > 50% of the mass is at 45, whereas if there was no switch point the mass should (theoretically) be closer to 1/80 = 1.125% at time 45.
What you are aiming to do is faithfully reconstruct the observed data set, given your model. In Chapter 2, their are simulations of generating fake data. If your observed data looks wildly different from your artificial data, then likely your model is not the correct fit.
I apologize for the non-rigourous answer, but really it's a major difficulty that I haven't efficiently overcome.
I did two things to fix your code. One was to initialize the model away from zero, the other one was to use a non-gradient based optimizer:
import pymc3 as pm
import numpy as np
import pandas as pd
import theano
import scipy as sp
data = pd.read_csv('jester-dense-subset-100x20.csv')
n, m = data.shape
test_size = m / 10
train_size = m - test_size
train = data.copy()
train.ix[:,train_size:] = np.nan # remove test set data
train[train.isnull()] = train.mean().mean() # mean value imputation
train = train.values
test = data.copy()
test.ix[:,:train_size] = np.nan # remove train set data
test = test.values
# Low precision reflects uncertainty; prevents overfitting
alpha_u = alpha_v = 1/np.var(train)
alpha = np.ones((n,m)) * 2 # fixed precision for likelihood function
dim = 10 # dimensionality
# Specify the model.
with pm.Model() as pmf:
pmf_U = pm.MvNormal('U', mu=0, tau=alpha_u * np.eye(dim),
shape=(n, dim), testval=np.random.randn(n, dim)*.01)
pmf_V = pm.MvNormal('V', mu=0, tau=alpha_v * np.eye(dim),
shape=(m, dim), testval=np.random.randn(m, dim)*.01)
pmf_R = pm.Normal('R', mu=theano.tensor.dot(pmf_U, pmf_V.T),
tau=alpha, observed=train)
# Find mode of posterior using optimization
start = pm.find_MAP(fmin=sp.optimize.fmin_powell) # Find starting values by optimization
step = pm.NUTS(scaling=start)
trace = pm.sample(500, step, start=start)
This is an interesting model that would make a great contribution. Please consider adding this, once you're certain it works as desired, to the examples folder and do a pull request.
Best Answer
It's generally true in my personal experience as a professional data scientist.
It's true in my personal experience because it's what I observe most of the time. If you're asking why it happens this way, it's for a few reasons:
Many traditional ML algorithms are nowadays available "off the shelf", including sophisticated ensemble methods, neural networks, etc. Probabilistic methods still often require bespoke solutions, either written in a DSL like Stan or directly in a general-purpose programming language.
Many people entering data science nowadays are coming from engineering and natural science backgrounds, where they have strong mathematical and "algorithmic" skills but don't have as much experience or intuition with probability modeling. It's just not on their radar, and they aren't as comfortable with the methods and the software required to implement them.
Making a "hard" prediction from a probabilistic model involves either hand-waving or formal decision theory. AI researchers and highly-paid statistical consultants know this, and they embrace it. But for the rank-and-file data scientist, it's not so easy to turn to your manager and start speaking in terms of distributions and probabilities. The business (or the automated system you're building) just needs a damn answer. Life is just a whole lot easier when you stop equivocating about probabilities and stuff, in which case you might as well not bother with them in the first place.
Probabilistic modeling often ends up being very computationally intensive, especially Bayesian modeling where closed-form solutions are a rare luxury, and doubly especially on "big" data sets. I wouldn't hesitate to run XGBoost on a data set with 10 million rows. I wouldn't even consider running a Stan model on a data set with 10 million rows.
Given all the downsides described above, a data scientist or a small team of data scientists can iterate much more quickly using less-probabilistic machine learning techniques, and get "good enough" results.
Edit: as pointed out in the comments, #1 and #2 could both be because probabilistic programming methods haven't yet been shown to have knockout performance on real-world problems. CNNs got popular because they blew away existing techniques.
Edit 2: it seems that probabilistic is getting popular for time series modeling, where deep learning doesn't seem as effective as in other domains.
Edit 3 (December 2020): probabilistic programming is now becoming much more popular in a wide variety of domains, as probabilistic programming languages get better and the solvers get better and faster.