Here is one approach at the automation. Feedback much appreciated. This is an attempt to replace initial visual inspection with computation, followed by subsequent visual inspection, in keeping with standard practice.
This solution actually incorporates two potential solutions, first, calculate burn-in to remove the length of chain before some threshold is reached, and then using the autocorrelation matrix to calculate the thinning interval.
- calculate a vector of the maximum median Gelman-Rubin convergence diagnostic shrink factor (grsf) for all variables in the
- find the minimum number of samples at which the grsf across all variables goes below some threshold, e.g. 1.1 in the example, perhaps lower in practice
- sub sample the chains from this point to the end of the chain
- thin the chain using the autocorrelation of the most autocorrelated chain
- visually confirm convergence with trace, autocorrelation, and density plots
The mcmc object can be downloaded here: jags.out.Rdata
# jags.out is the mcmc.object with m variables
library(coda)
load('jags.out.Rdata')
# 1. calculate max.gd.vec,
# max.gd.vec is a vector of the maximum shrink factor
max.gd.vec <- apply(gelman.plot(jags.out)$shrink[, ,'median'], 1, max)
# 2. will use window() to subsample the jags.out mcmc.object
# 3. start window at min(where max.gd.vec < 1.1, 100)
window.start <- max(100, min(as.numeric(names(which(max.gd.vec - 1.1 < 0)))))
jags.out.trunc <- window(jags.out, start = window.start)
# 4. calculate thinning interval
# thin.int is the chain thin interval
# step is very slow
# 4.1 find n most autocorrelated variables
n = min(3, ncol(acm))
acm <- autocorr.diag(jags.out.trunc)
acm.subset <- colnames(acm)[rank(-colSums(acm))][1:n]
jags.out.subset <- jags.out.trunc[,acm.subset]
# 4.2 calculate the thinning interval
# ac.int is the time step interval for autocorrelation matrix
ac.int <- 500 #set high to reduce computation time
thin.int <- max(apply(acm2 < 0, 2, function(x) match(T,x)) * ac.int, 50)
# 4.3 thin the chain
jags.out.thin <- window(jags.out.trunc, thin = thin.int)
# 5. plots for visual diagnostics
plot(jags.out.thin)
autocorr.plot(jags.win.out.thin)
--update--
As implemented in R the computation of the autocorrelation matrix is slower than would be desirable (>15 min in some cases), to a lesser extent, so is computation of the GR shrink factor. There is a question about how to speed up step 4 on stackoverflow here
--update part 2--
additional answers:
It is not possible to diagnose convergence, only to diagnose lack of convergence (Brooks, Giudici, and Philippe, 2003)
The function autorun.jags from the package runjags automates calculation of run length and convergence diagnostics. It does not start monitoring the chain until the Gelman rubin diagnostic is below 1.05; it calculates the chain length using the Raftery and Lewis diagnostic.
Gelman et al's (Gelman 2004 Bayesian Data Analysis, p. 295, Gelman and Shirley, 2010) state that they use a conservative approach of discarding the 1st half of the chain. Although a relatively simple solution, in practice this is sufficient to solve the issue for my particular set of models and data.
#code for answer 3
chain.length <- summary(jags.out)$end
jags.out.trunc <- window(jags.out, start = chain.length / 2)
# thin based on autocorrelation if < 50, otherwise ignore
acm <- autocorr.diag(jags.out.trunc, lags = c(1, 5, 10, 15, 25))
# require visual inspection, check acceptance rate
if (acm == 50) stop('check acceptance rate, inspect diagnostic figures')
thin.int <- min(apply(acm2 < 0, 2, function(x) match(TRUE, x)), 50)
jags.out.thin <- window(jags.out.trunc, thin = thin.int)
Check out the DHARMa package in R. It uses a simulation based approach with quantile residuals to generate the type of residuals you may be interested in. And it works with glm.nb
from MASS.
The essential idea is explained here and goes in three steps:
- Simulate plausible responses for each case. You can use the distribution of each regression coefficient (coefficient together with standard error) to generate several sets of coefficients. You can multiply each set of coefficients by the observed predictors for each case to obtain multiple simulated response values for each case.
- from the multiple response values for each case, generate the empirical cumulative density function (cdf)
- find the value of the empirical cdf that corresponds to the observed response for each case. This is your residual and it ranges from 0 to 1.
If there is no systematic misspecification in your model, all values from the empirical cdf are equally likely. And a histogram of these residuals should be uniformly distributed between 0 and 1.
The package contains additional checks.
Edit:
The steps above are not exactly correct. The biggest difference between my description and what DHARMa does is DHARMa uses the simulate()
function in base R, which ignores the variability in the estimated regression coefficients. The Gelman and Hill regression text recommends taking the variability in regression coefficients into account.
A crucial step I forgot to include is: once one has generated the responses, we should place them on the response scale. For example, the predicted variables from logistic regression are logits, so one should place them on the probability scale. Next step would be to randomly generate observed scores using the predicted probabilities. Continuing with the logistic regression example, one can use rbinom()
to generate 0-1 outcomes given predicted probabilities.
Additionally, when the responses are integers, such as with binary outcomes or count data models, DHARMa adds random noise to the simulated responses prior to computing the empirical cdf. This step ensures the residuals behave as one would expect in OLS if the model was not misspecified. Without it, you could have a pile up of residuals at 1 if your outcome is binary outcome.
The code for the simulate residuals function in DHARMa is relatively easy to follow for anyone looking to roll their own implementation.
Best Answer
I think the use of the term residual is not consistent with Bayesian regression. Remember, in frequentist probability models, it's the parameters which are considered fixed estimable quantities and the data generating mechanism has some random probability model associated with observed data. For Bayesians, the parameters of probability models are considered to be variable and the fixed data update our belief about what those parameters are. Therefore, if you were calculating the variance of the observed minus fitted values in a regression model, the observed component would have 0 variance whereas the fitted component would vary as a function of the posterior probability density for the model parameters. This is the opposite of what you would derive from the frequentist regression model. I think if one were interested in checking the probabilistic assumptions of their Bayesian regression model, a simple QQplot of the posterior density of parameter estimates (estimated from our MCMC sampling) versus a normal distribution would have diagnostic power analogous to analyzing residuals (or Pearson residuals for non-linear link functions).