Sample Size – Choosing Batch Size for Multivariate Effective Sample Size (multiESS)

markov-chain-montecarlosample-size

(For those interested, my MATLAB implementation of multiESS is available here.)


I am re-implementing the multivariate Effective Sample Size (multiESS) estimator for MCMC from Vats et al. (2015). User @Greenparker gave an excellent description of the method in this answer; I copy here the relevant part since I couldn't do better (check her full answer for more details):

The multivariate ESS returns one number for the effective sample size for the quantities you want to estimate; and it does so by accounting for all the cross-correlations in the process. […] Suppose you are interested in the $p$-vector of means of the posterior distribution. The mESS is defined as follows
$$\text{mESS} = n \left(\dfrac{|\Lambda|}{|\Sigma|}\right)^{1/p}. $$
Here

  1. $\Lambda$ is the covariance structure of the posterior (also the asymptotic covariance in the CLT if you had independent samples)
  2. $\Sigma$ is the asymptotic covariance matrix in the Markov chain CLT (different from $\Lambda$ since samples are correlated.
  3. $p$ is number of quantities being estimated (or in this case, the dimension of the posterior.
  4. $|\cdot|$ is the determinant.
  5. [$n$ is the number of samples.]

mESS can be estimated by using the sample covariance matrix to estimate $\Lambda$ and the batch means covariance matrix to estimate $\Sigma$. This has been coded in the function multiESS in the R package mcmcse.

The only nontrivial part in the algorithm is the batch estimation of $\Sigma$, which requires a choice of the batch size $1 \le b_n \le n$.

The default choice for $b_n$ in multiESS is $\lfloor n^{1/2} \rfloor$. However, the method to choose $b_n$ is the paper is not completely clear and it is somewhat arbitrary (e.g., the authors sometimes pick $\lfloor n^{1/3} \rfloor$). Also, the authors show that the choice of $b_n$ should depend on the mixing time of the Markov chain, with slow-mixing processes requiring a larger $b_n$.

In practice, the choice of $b_n$ can affect the output of $\text{mESS}$ by an order of magnitude or even more.

As a practical, conservative choice, I am thinking of using for analyses the lower bound $\text{mESS}^*$ defined as follows:
$$
\text{mESS}^* \equiv \min_{b_\text{min} \le b_n \le b_\text{max}} \text{mESS}(b_n)$$
where $b_\text{min} \ge 1$ is a small number (e.g., $b_\text{min} \equiv \lfloor n^{1/4} \rfloor$) and $b_\text{max} \le n$ is taken such that there is a reasonable number of batches to average over (e.g., $b_\text{max} \equiv \lfloor n/10 \rfloor$).

Does this sound sensible? Any other suggestion on how to choose $b_n$ in an objective way?

(A quick test suggests that $\text{mESS}^*$ seems to be not as conservative as taking the minimum of the univariate ESS across dimensions; so it probably strikes a good balance.)

Best Answer

Good question. FYI, this problem is also there in the univariate setting, and is not only in the multivariate ESS.

This is the best I can think of right now. The choice of choosing the optimal batch size is an open problem. However, it is clear that in terms of asymptotics, $b_n$ should increase with $n$ (this is mentioned in the paper also I think. In general it is known that if $b_n$ does not increase with $n$, then $\Sigma$ will not be strongly consistent).

So instead of looking at $1 \leq b_n \leq n$ it will be better (or at least theoretically better) to look at $b_n = n^{t}$ where $0 < t < 1$. Now, let me first explain how the batch means estimator works. Suppose you have a $p$-dimensional Markov chain $X_1, X_2, X_3, \dots, X_n$. To estimate $\Sigma$, this Markov chain is broken into batches ($a_n$ batches of size $b_n$), and the sample mean of each batch is calculated ($\bar{Y}_i$).

$$ \underbrace{X_1, \dots, X_{b_n}}_{\bar{Y}_{1}}, \quad \underbrace{X_{b_n+1}, \dots, X_{2b_n}}_{\bar{Y}_{2}}, \quad \dots\quad ,\underbrace{X_{n-b_n+1},\dots, X_{n}}_{\bar{Y}_{a_n}}.$$

The sample covariance (scaled) is the batch means estimator.

If $b_n = 1$, then the batch means will be exactly the Markov chain, and your batch means estimator will estimate $\Lambda$ and not $\Sigma$. If $b_n = 2$ then that means you are assuming that there is only a significant correlation upto lag 1, and all correlation after that is too small. This is likely not true, since lags go upto over and above 20-40 usually.

On the other hand, if $b_n > n/2$ you have only one batch, and thus will have no batch means estimator. So you definitely want $b_n < n/2$. But you also want $b_n$ to be low enough so that you have enough batches to calculate the covariance structure.

In Vats et al., I think they choose $n^{1/2}$ for when its slowly mixing and $n^{1/3}$ when it is reasonable. A reasonable thing to do is to look at how many significant lags you have. If you have large lags, then choose a larger batch size, and if you have small lags choose a smaller batch size. If you want to use the method you mentioned, this I would restrict $b_n$ to be over a much smaller set. Maybe let $T = \{ .1, .2, .3, .4, .5\}$, and take $$ \text{mESS}^* = \min_{t \in T, b_n = n^t} \text{mESS}(b_n)$$

From my understanding of the field, there is still some work to be done in choosing batch sizes, and a couple of groups (including Vats et al) are working on this problem. However, the ad-hoc way of choosing batch sizes by learning from the ACF plots seems to have worked so far.

EDIT------

Here is another way to think about this. Note that the batch size should ideally be such that the batch means $\bar{Y}$ have no lag associated with it. So the batch size can be chosen such that the acf plot for the batch means shows no significant lags.

Consider the AR(1) example below (this is univariate, but can be extended to the multivariate setting). For $\epsilon \sim N(0,1)$. $$x_t = \rho x_{t-1} + \epsilon. $$

The closer $\rho$ is to 1, the slower mixing the chain is. I set $\rho = .5$ and run the Markov chain for $n = 100,000$ iterations. Here is the ACF plot for the Markov chain $x_t$.

enter image description here

Seeing how there are lags only up to 5-6, I choose $b_n = \lfloor n^{1/5} \rfloor$, break the chain into batches, and calculate the batchmeans. Now I present the ACF for the batch means.

enter image description here

Ah, there is one significant lag in the batch means. So maybe $t = 1/5$ is too low. I choose $t = 1/4$ instead (you could choose something inbetween also, but this is easier), and again look at the acf of the batch means.

enter image description here

No significant lags! So now you know that choosing $b_n = \lfloor n^{1/4} \rfloor$ gives you big enough batches so that subsequent batch means are approximately uncorrelated.

Related Question