Question about the Reproducibility of Stan Results

I’m in the process of writing up replication documentation for a manuscript and had some questions about the reproducibility section of the manual here: 20 Reproducibility | Stan Reference Manual (mc-stan.org)

The way it is written in the reference manual makes it sound as if the results of an analysis may be wildly different even if the models are run under the same version of Stan/cmdstan, R, and brms used for the original analysis. I was under the impression that differences attributable to slight variation in floating point precision across modern systems shouldn’t really amount to more than a rounding difference and certainly shouldn’t lead to fundamentally different results so I’m wondering if I’m mistaken on that?

The way I currently have it documented in my appendix is:

Note: All analyses presented herein were performed on a Windows 10 desktop computer with a 12 core Ryzen 9 5900X CPU, 128GB of DDR4 Memory, and an Nvidia RTX 3080TI GPU using cmdstan 2.8.2 as a backend via cmdstanr version 0.4.0.90001 and brms version 2.16.4. Due to the stochastic nature of Stan’s HMC algorithm and system-specific floating point precision, reproducing the analysis herein requires installation of the versions used to fit the original models. Exact reproducibility is only possible if precise system and environment requirements are met. However, results should generally match within a rounding difference. See the reproducibility section of the Stan reference manual for further details.

and I’m wondering if this is correct/sufficient or if I need additional details?

@andrewgelman @betanalpha @avehtari

I don’t think that there’s any strong guarantee that results should match within a rounding difference (though it might be quite likely that they will), because Stan’s algorithm uses thresholds to decide what to do, and if a rounding difference happens to push you across an important threshold, you could end up with differences in output that look much larger than just rounding errors.

For example, given a simulated Hamiltonian trajectory, Stan has to select which point along that trajectory to transition to. At its core, it does this by comparing RNG output to differences in log-probability at points along the trajectory. It’s conceivable (though again, not necessarily likely) that a rounding difference could result in a different point getting selected, and then everything that happens from then on will be much bigger than rounding error.

Stan also uses thresholds to decide if the NUTS criterion has been reached and to decide whether a divergence has been encountered.

Another way for rounding differences to yield thresholded differences in behavior is if a rounding difference makes the difference between encountering a numerical problem (e.g. an infinite gradient) or not. Even if this occurs early during warmup and doesn’t compromise inference, it would lead to different numbers downstream during fitting.

3 Likes

Alright, I based the rounding difference statement on results for the same model from another system with entirely different specifications (Intel i9 9900k CPU instead of an AMD one, different memory modules, and an RTX 3090 GPU) and the results were identical down to the second significant figure but it seems that may not be generalizable to other systems?

Aside from setting the rng seed directly in every function (which I’ve done) is there any way to best maximize reproducibility of results across systems (down to one or two significant figures or so)? For example, if a logistic regression coefficient estimate is 1.312 on one system and 1.255 on another one, that’s not a substantively meaningful difference. Something like 1.312 and 0.62 for the same model across two different systems that differ only in their hardware specifications, that is a lot more problematic.

Even if we don’t set RNG seeds or worry about platform details, we don’t expect quantities like coefficient estimates to differ by much. Or rather, if such quantities differ by a lot it either means that effective sample sizes too low (such that the MCMC standard error is high) or that at least one of the runs did not reach proper convergence (which we hope we would pick up with convergence diagnostics, though there’s never a guarantee). So as long as the algorithm is working properly to fit the model, we don’t expect big differences in the inference. But that’s not the same thing as reproducibility in the numerical values of the individual MCMC samples.

I would strongly suggest that if your posterior medians or means are as different as 1.31 versus 1.26, then the underlying streams of samples very probably look completely different from one another, and something has gone fundamentally wrong with your attempt to ensure strong-sense reproducibility. That is, robust inference should be reproducible even if you make no attempt to set the same seeds and reproduce the RNG streams. Once we get into worrying about the RNG seeds, we are seeking a stronger form of reproducibility, for which we have to worry about little details that are unimportant to the gist of our inference.

5 Likes

Okay, that makes sense. Convergence diagnostics on these ones all check out and the effective sample sizes are probably more than sufficient (all are in excess of 20,000 based on 3k warmup iterations and 5k post-warmup iterations across six chains 30k post-warmup draws in total which is probably overkill, tbh). If inferences are based on coefficient estimates/predicted probabilities based on a subset of the total samples and repeated random draws of ~3000 from the posterior don’t produce substantively meaningful differences in the resulting estimates between systems does it necessarily matter whether the individual MCMC samples are identical?

It shouldn’t in general matter to your inference, but it is what people are talking about when they discuss strict reproducibility and worry about setting RNG seeds. It is a useful hedge against the possibility that the algorithm manifests some kind of weirdness or you run into an atypical part of the RNG stream that gives you an atypical set of MCMC samples. At least you’ll be able to reproduce what happened.

2 Likes

I just wanted to complement the discussion with some formal statements.

Markov chain Monte Carlo, including Hamiltonian Monte Carlo, is a stochastic algorithm. In theory each time we run the algorithm we should generate a different Markov chain and hence different Markov chain Monte Carlo estimators. Assuming that the Markov chain Monte Carlo algorithm is reasonably well-behaved these estimators will converge to the same answer only for infinitely long Markov chains. In other words even in theory realized finite Markov chains, and the subsequent estimators, are not reproducible!

Although the finite-length behavior isn’t deterministically reproducible, for especially well-behaved Markov chain Monte Carlo algorithms it can be characterized probabilistically using a Markov chain Monte Carlo central limit theorem. This is why we have to check so many diagnostics – the useful diagnostics indicate obstructions to that central limit theorem and our ability to quantify the behavior of expectation value estimators from finite-length Markov chains.

What does this say about reproducibility? Let’s say that we have two realized Markov chains, generated either from the same algorithm or different algorithms. If all of the algorithms satisfy Markov chain Monte Carlo central limit theorems then the difference between the Markov chain Monte Carlo estimators for the expectation value \mathbb{E}_{\pi}[f] is approximately normally distributed,

\hat{f}_{1} - \hat{f}_{2} \sim \text{normal}(0, \sqrt{ \text{MCMC-SE}_{1}[f]^{2} + \text{MCMC-SE}_{2}[f]^{2} } ),

where \text{MCMC-SE} refers to the Markov chain Monte Carlo standard errors for the estimates. These are estimated along with the Markov chain Monte Carlo estimators themselves; for example in Stan they are displayed in the print/stansummary function outputs. In summary Markov chains, and subsequent Markov chain Monte Carlo estimators, are not deterministically reproducible but we can quantify their variation probabilistically.

All of these considerations are in theory. In practice the stochastic generation of Markov chains is determined by a pseudo-random number generator, which itself is initialized with a seed (although most seeds are actually pretty bad; see for example the discussion in Section 2.2.2 of Rumble in the Ensemble). If the computer hardware, operating system, and code are all fixed then the pseudo random number generator output, and hence the generated Markov chains, should also be fixed and exactly reproducible.

Even small changes to the hardware, operating system, or code, however, can have substantial changes to the pseudo random number generator output and Markov chain output. Indeed the better designed the pseudo random number generator and Markov chain Monte Carlo algorithm the more sensitive the outputs will be to small changes! In practice all of these changes are essentially equivalent to changing the pseudo random number generator seed. If we can’t control everything then we effectively cannot control the seed, and all we can expect in terms of reproducibility is the probabilistic variation afforded by the Markov chain Monte Carlo central limit theorem (and then only if the central limit theorem actually holds!).

1 Like