Does effective sample size estimation require independent chains?

Hi everyone,

I was wondering if the effective sample size calculation as described in Vehtari et al 2019 could be applied to MCMC sampler whose chains are not formally independent like Goodman & Weare’s Affine Invariant Markov chain Monte Carlo (emcee is a python package that implements this algorithm).

I have been reading the \hat{R}-ESS paper, and there are various references to independent chains, however, they are assumed to be correlated, so it looks like they are not really treated as independent. So my question basically is: does the fact that MCMC chains in HMC/NUTS are formally independent have any relevance for the effective sample size algorithm? Could this algorithm be applied to the results of affine invariant MCMC runs?

I suspect the algorithm is valid for both, and I have done some simple experiments which seem to agree with this statement, but it would be great to have some proper explanations about the applicability conditions of the algorithm.

Here is the link to the complete code for the experiments using emcee, and below there is one comparison between emcee’s autocorrelation time estimation, the initial proposal from Goodman & Weare to calculate autocorrelation and ArviZ implementation:

image

References

  • Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, Paul-Christian Bürkner (2019): Rank-normalization, folding, and localization: An improved \hat{R} for assessing convergence of MCMC. arXiv preprint arXiv:1903.08008.
2 Likes

I think there might be some confusion right there. When one says two chains (X_1)_{n\geq 1} and (X_2)_{n\geq1} are independent one means Pr(X_2^{(i)} = x \mid X_1^{(j)} = y) = Pr(X_2^{(i)} = x) for any i,j, x, y. This says nothing about Pr(X_1^{(i)} = x \mid X_1^{(j)} = y).
In other words, contrary to some other algorithms, the chains in Stan are not coupled, but the states within each chain are still correlated within the chain, i.e., the chains are autocorrelated.

2 Likes

I am most surely confused at some point, however, following the notation of the paper, the autocorrelation time (\hat{\tau}) is calculated from \hat{\rho}_t which takes into account both W (the within chain variance) and \hat{\text{var}}^+ which combines both W and B (the between chain variance).

Also, after having done the splitting, all “half” chains are treated in the same way, even though after splitting, the condition:

Pr(X^{(i)}_2=x∣X^{(j)}_1=y)=Pr(X^{(i)}_2=x) \qquad \forall i,j,x,y

does not hold anymore. Therefore, is that condition really required for the algorithm to be applied? Stan’s chains fulfil it, but other sampler like emcee as I commented before do not which is what motivates my question.

1 Like

I’m sorry I misunderstood your question initially. While I can’t offer a complete answer now (and it would be better for @avehtari to give a definitive statement anyway), one thing to remark is that it does not make much sense to compare chains that to do not target the same distribution. I am not familiar with AI-MCMC, I can comment on other algorithms, such as Metropolis-coupled MCMC, in which the target is tempered and “particles” are exchanged between “cold” and “hot” chains. In this case I argue it would not be sensible to employ \hat{R} as the chains do not have the same target and hence comparing variances is comparing apples to oranges.

1 Like

Thanks for all the comments and help on making the question clear enough! In AI-MCMC, all chains target the same distribution. Quite roughly, its chains are dependent because in order to avoid calculating the derivatives, it uses the information from the other chains to select their movement in the parameter space.

This means that chains are run independently with different random seeds (with that big striding that it’s very unlikely that pseudo random sequences overlap). The iterations within each chain are autocorrelated.

\hat{R} and the ESS for combined chains as presented in the paper are not valid for chains which communicate during the iterations and thus are not independent from each other. You could use \hat{R} for independent runs of such algorithms which have dependent chains within each run, but I don’t know how ESS would be computed as it would need to take in to account between chain dependencies, too.

2 Likes

Thanks @avehtari ! Just one question left then. I hope it is not too trivial. Then, can the first half of a chain and its second half be considered independent too? If not, why can splitting be used before calculating ESS?

Given the autocorrelation goes near zero in time which is much smaller than the number of iterations in each half, then they can be considered independent. In cases where the autocorrelation doesn’t go to zero fast enough, other terms will dominate. Based on the extensive simulations there is a slight benefit in splitting the chains even if we can assume only approximate independence between each half-chain pair (there is balance between estimating big lag autocorrelations accurately and using split chains in case of very long dependencies within chains).

2 Likes

The thing to do is run the algorithms on simulated data and calculate empirical standard (squared) error. That implies an ESS because std-err = std-dev / sqrt(ESS). This is how Aki measured calibration of the new ESS estimator. (You have to run a long time to get good precision for the true std-dev of the postrior.)

What you’ll find is that these ensemble methods don’t scale well with dimensions because interpolation/extrapolation quickly runs outside of the typical set, which looks more like a donut than a ball in most posteriors in high dimensions.

1 Like

Thanks a lot @Bob_Carpenter! I will try it whenever I have time. I checked the algorithm implemented in emcee, and both results converge because they are doing nearly the same. The only two differences are the threshold when estimating \tau = \sum \rho_t and the normalization to get autocorrelations from autocovariances. emcee uses

\rho(t) = \frac{1}{M} \sum^M c_{m}(t)/c_m(0)

where c are autocovariances c(t) = \int (\theta^{(n)}-\mu)(\theta^{(n+t)}-\mu)p(\theta) d\theta) instead of using

\rho(t) = 1- \frac{W - \frac{1}{M} \sum^M c_{m}(t)}{\widehat{var}^+}

what I still do not understand is how is chain dependency taken into account (if it is taken at all). This approach will probably solve this issue (as I suspect it is not), so thanks a lot.

Notes: I have modified slightly the notation in the paper, but I think it will be clear anyways. Also, I took formulas 13 and 14 in appendix A without the \frac{1}{\sigma^2} factor, hence the c, it looks like a typo.

Yes! I know, but thanks for the heads up anyway :). My main concerns are curiosity and documenting properly ArviZ, which also supports emcee, so I think we should say whether the ess function implemented is compatible with emcee results or not.

Thanks everybody for all your help! I will come back in some days after checking ESS calibration in emcee results.

1 Like

I had some time today to run a very simple model. I used a Normal prior p(\theta) = \mathcal{N}(0, 10) and as likelihood p(y|\theta) = \mathcal{N}(\theta, 2). For simplicity I used a single observation y=0 so that the posterior is p(\theta|y=0) = \mathcal{N}(0, \sigma=\frac{10}{\sqrt{26}}).

I sampled a posterior using 4 chains and 700,000 draws with PyMC3 and PyStan, and with emcee, using the same number of draws and 6 and 150 walkers. I plotted the variance of the error in E(\theta) and \sigma^2/\text{ESS} for all 4 sets of posterior samples. I uploaded the code used in case somebody wants to consult it. Here are the results:

image

image

These are emcee results for 150 walkers:

image

and for 6 walkers:

image

Is this what you had in mind? It looks like the ESS does work for emcee, I don’t know why though, any idea? I will try to run some more models on Monday or Tuesday to see if this was a coincidence.

2 Likes

Sounds like a reasonable motivation.

With Stan, we use cross-chain information in our ESS calculations much like we do for R-hat. I’ve read that Goodman and Weare paper, but never understood the ESS part of that, which seems to be based on a piece of software I don’t know.

I’m not that surprised. My guess is that there aren’t very strong correlations across chains. If you get a model that mixes poorly, so you get a lot of cross-chain correlation, you might see different behavior, especially if it hasn’t mixed well yet.