Simulation-based calibration (SBC) and thinning

Hi all,

In the Stan reference manual states that “it should be emphasized that the only reason to thin a sample is to reduce memory requirements”. This is a common view I see mentioned in the Stan forums.

However, in Figure 11 of the Validating Bayesian Inference Algorithms with Simulation-Based Calibration paper, shows that even after doing non-centered parameterisation, thinning is still needed as part of model diagnostics using SBC.

My question is, suppose we find a thinning value to promote a uniform distribution shape on the SBC histogram, should we use that same thinning value when doing MCMC sampling for our final model?

If the answer is yes, wouldn’t that contradict the point made in the Stan manual around thinning should only be used to reduce memory requirements?

If the answer is no, why wouldn’t we use thinning for our final model if the SBC histogram indicates that posterior samples have strong autocorrelation?

Many thanks

3 Likes

@betanalpha @hyunji.moon

suppose we find a thinning value to promote a uniform distribution shape on the SBC histogram, should we use that same thinning value when doing MCMC sampling for our final model?

My understanding is yes.

FYI, there is an option in SBC package that prevents re-running SBC and inspect just the effect of thinning.
thin_ranks from Fit datasets and evaluate diagnostics and SBC metrics. — compute_results • SBC

2 Likes

No need to use the same thinning unless you’re interested in extreme rank statistics. Otherwise more draws will give you better accuracy. SBC is based on examining ranks including extreme ranks, and those have significant bias in case of correlated draws. We’ll update next week a paper mentioning these issues.

3 Likes

Markov chain Monte Carlo estimation extracts information from every state of a Markov chain, even when those states are correlated. Thinning a Markov chain removes some of those states and hence some of that information, resulting in Markov chain Monte Carlo estimators that are a less accurate. If the autocorrelations are large enough, however, then only an infinitesimal amount of information is lost, and the Markov chain Monte Carlo estimator accuracy remains practically the same, in which case nothing is lost by thinning.

The challenge with thinning is determining when the autocorrelations are large enough that thinning won’t degrade the Markov chain Monte Carlo estimator accuracy significantly, or at least understanding what degradations are worth any benefits gained from having shorter Markov chains with less memory requirements. The logic in the Stan reference manual is that if one has enough memory to store the full Markov chain without problems then there’s no benefit to thinning and we might as well use all of the information available.

Thinning in the SBC™ method introduced in that paper is for a different purpose. The proof that correct computation yields uniform ranks in that method relies on independent posterior samples with no correlation. Correlations between the posterior samples skews the rank distribution; while it may be possible to account for that skew theoretically, that mathematical analysis has not yet been accomplished. Instead we employed a more heuristic approach – if a Markov chain is sufficiently longer than the autocorrelation length then we can thin that Markov chain enough so that the remaining states are independent to sufficient accuracy that the SBC™ proof applies. It’s very much a hack, but that hack is the best strategy without access to some serious mathematical machinery.

The question about how to resolve these two instances of thinning hits on another interesting topic, which is why thinning even works in the first place. This is a technical topic but I’ll do my best to avoid any unnecessary detail.

A Markov chain is a sequence of states generated by a Markov transition \tau. A Markov transition defines a probabilistic map from one state to a range of possible states, and sampling from that transition gives a map from one state to another. Applying this transition operation to an initial state gives a new state, and applying the transition to that new state yields another new state. Iterating generates a Markov chain.

Markov chain Monte Carlo estimates expectation values with respect to a target distribution with empirical averages over the sequential states of a Markov chain. If a given Markov transition interacts well enough with the given target distribution then these estimators are well-behaved; not only do they converge to the exact expectation values as the Markov chains grow to be longer and longer but also we can quantify the accuracy of the estimators for any given Markov chain length.

When we try to formally prove things about thinning we take a given Markov transition \tau and define a new Markov transition \tau_{N} by repeating that component transition N times. In other words applying \tau a total of N times gives the sequence of states x_{i}, x_{i + 1}, \ldots, x_{i + N - 1}, x_{i + N} while applying \tau_{N} once gives the sequence x_{i}, x_{i + N}.

This might appear unnecessarily complicated but it’s really powerful mathematically. For example we can show that if \tau interacts well with the target distribution (technically if it’s geometrically ergodic) then so too will \tau_{N}. Moreover in that case we can prove that the autocorrelations generated by \tau_{N} are always smaller than the autocorrelations generated by \tau, i.e. by repeating the transition over and over again the resulting states will appear to be less correlated.

I believe (the mathematics here are subtle, see for example [2110.07032] A Short Review of Ergodicity and Convergence of Markov chain Monte Carlo Estimators, and I’m not a theorist so no guarantees!) that the converse is also true; if \tau_{N} interacts well with the target distribution then so too will \tau. In this case validating one automatically validates the other.

Okay so we can finally address the question. SBC™ applied to Markov chain Monte Carlo output technically doesn’t diagnose a lack of pathological behavior for \tau but rather for the thinning transition \tau_{N}. If validation of \tau_{N} also implies a validation for \tau then one can use either the full Markov chain or the thinned Markov chain in the subsequent analysis.

Anyways, a really important question that definitely wasn’t addressed carefully in the paper!

6 Likes