Solved – Stan $\hat{R}$ versus Gelman-Rubin $\hat{R}$ definition

convergencegibbsmarkov-chain-montecarlometropolis-hastingsstan

I was going through the Stan documentation which can be downloaded from here. I was particularly interested in their implementation of the Gelman-Rubin diagnostic. The original paper Gelman & Rubin (1992) define the the potential scale reduction factor (PSRF) as follows:

Let $X_{i,1}, \dots , X_{i,N}$ be the $i$th Markov chain sampled, and let there be overall $M$ independent chains sampled. Let $\bar{X}_{i\cdot}$ be the mean from the $i$th chain, and $\bar{X}_{\cdot \cdot}$ be the overall mean. Define,
$$W = \dfrac{1}{M} \sum_{m=1}^{M} {s^2_m}, $$
where
$$s^2_m = \dfrac{1}{N-1} \sum_{t=1}^{N} (X_{m t} – \bar{X}_{m \cdot})^2\,. $$
And define $B$
$$B = \dfrac{N}{M-1} \sum_{m=1}^{M} (\bar{X}_{m \cdot} – \bar{X}_{\cdot \cdot})^2 \,.$$

Define $$\hat{V} = \left(\dfrac{N-1}{N} \right)W + \left( \dfrac{M+1}{MN} \right)B\,.$$
The PSRF is estimate with $\sqrt{\hat{R}}$ where
$$ \hat{R} = \dfrac{\hat{V}}{W} \cdot \dfrac{df+3}{df+1}\,,$$
where $df = 2\hat{V}/Var(\hat{V})$.

The Stan documentation on page 349 ignores the term with $df$ and also removes the $(M+1)/M$ multiplicative term. This is their formula,

The variance estimator is
$$\widehat{\text{var}}^{+}(\theta \, | \, y) = \frac{N-1}{N} W + \frac{1}{N} B\,.$$
Finally, the potential scale reduction statistic is defined by
$$ \hat{R} = \sqrt{\frac{\widehat{\text{var}}^{+}(\theta \, | \, y) }{W}}\,. $$

From what I could see, they don't provide a reference for this change of formula, and neither do they discuss it. Usually $M$ is not too large, and can often be as low so as $2$, so $(M+1)/M$ should not be ignored, even if the $df$ term can be approximated with 1.

So where does this formula come from?


EDIT:
I have found a partial answer to the question "where does this formula come from?", in that the Bayesian Data Analysis book by Gelman, Carlin, Stern, and Rubin (Second edition) has exactly the same formula. However, the book does not explain how/why it is justifiable to ignore those terms?

Best Answer

I followed the specific link given for Gelman & Rubin (1992) and it has $$ \hat{\sigma} = \frac{n-1}{n}W+ \frac{1}{n}B $$ as in the later versions, although $\hat{\sigma}$ replaced with $\hat{\sigma}_+$ in Brooks & Gelman (1998) and with $\widehat{\rm var}^+$ in BDA2 (Gelman et al, 2003) and BDA3 (Gelman et al, 2013).

BDA2 and BDA3 (couldn't check now BDA1) have an exercise with hints to show that $\widehat{\rm var}^+$ is unbiased estimate of the desired quantity.

Gelman & Brooks (1998) has equation 1.1 $$ \hat{R} = \frac{m+1}{m}\frac{\hat{\sigma}_+}{W} - \frac{n-1}{mn}, $$ which can be rearranged as $$ \hat{R} = \frac{\hat{\sigma}_+}{W} + \frac{\hat{\sigma}_+}{Wm}- \frac{n-1}{mn}. $$ We can see that the effect of second and third term are negligible for decision making when $n$ is large. See also the discussion in the paragraph before Section 3.1 in Brooks & Gelman (1998).

Gelman & Rubin (1992) also had the term with df as df/(df-2). Brooks & Gelman (1998) have a section describing why this df corretion is incorrect and define (df+3)/(df+1). The paragraph before Section 3.1 in Brooks & Gelman (1998) explains why (d+3)/(d+1) can be dropped.

It seems your source for the equations was something post Brooks & Gelman (1998) as you had (d+3)/(d+1) there and Gelman & Rubin (1992) had df/df(-2). Otherwise Gelman & Rubin (1992) and Brooks & Gelman (1998) have equivalent equations (with slightly different notations and some terms are arranged differently). BDA2 (Gelman, et al., 2003) doesn't have anymore terms $\frac{\hat{\sigma}_+}{Wm}- \frac{n-1}{mn}$. BDA3 (Gelman et al., 2003) and Stan introduced split chains version.

My interpretation of the papers and experiences using different versions of $\hat{R}$ is that the terms which have been eventually dropped can be ignored when $n$ is large, even when $m$ is not. I also vaguely remember discussing this with Andrew Gelman years ago, but if you want to be certain of the history, you should ask him.

Usually M is not too large, and can often be as low so as 2

I really do hope that this is not often the case. In cases where you want to use split-$\hat{R}$ convergence diagnostic, you should use at least 4 chains split and thus have M=8. You may use less chains, if you already know that in your specific cases the convergence and mixing is fast.

Additional reference:

  • Brooks and Gelman (1998). Journal of Computational and Graphical Statistics, 7(4)434-455.
Related Question