How much precision should we expect in our fits?

First, I should have been more specific: We can decide that any change in floating point operation order should not introduce more variation than what is the limitations of floating point operation accuracy (which can be less than the fp presentation accuracy), and then run long enough chains so that Monte Carlo error is about the same order. This may require super long chains.

For Monte Carlo standard errors see https://arxiv.org/abs/1903.08008, and these are computed, e.g., by stansummary in CmdStan and by monitor in rstan. These are ok if there are no divergences (edit: in @betanalpha divergence diagnostic sense).

I just wanted to add some detail to @avehtari’s response.

The thing with Markov chains is that if they’re mixing quickly (which is what we hope to accomplishing using dynamic HMC in Stan) then they can be very unstable even if targeting the same distribution. A small change to the seed or random number generator used in the transitions will cause two chains starting at the same point to decouple from each other and explore the typical set very differently. At the same time even small perturbations to the deterministic calculations in each transition can cause the random numbers to be used slightly differently and have the same effect.

This is why we can’t guarantee reproducibility of the exact chain evolution on different machines, OS, compiler, etc. Seemingly trivial differences in the floating point arithmetic implementation, really the exact assembly code that implements composite operators, will cause more than enough perturbation to decouple chains started at the same point. If the assembly is exactly the same then we can expect/require reproducibility of any chain states, but otherwise we have to rely on something else.

In general the only guarantee we have for MCMC is that after an infinite number of iterations empirical estimators converge to the exact expectation values. There is no guarantee about finite time behavior of the empirical estimators, nor is there any guarantee about the behavior of the individual states in the Markov chain. If we have stronger conditions, like geometric ergodicity, then we get guarantees on the qualitative behavior of the empirical estimators in finite time, namely a central limit theorem.

So if we have those stronger conditions – we can’t prove them but we can look for violations of them, which is what diagnostics like Rhat and divergences capture – then we can compare different MCMC outputs by their empirical estimators and corresponding MCMC standard errors. Because both chains have standard errors a proper comparison has to take advantage of the relationship

\hat{f}_{1} / \sigma_{\hat{f}_{1}} - \hat{f}_{2} / \sigma_{\hat{f}_{2}} \sim \mathcal{N}(0, \sqrt{2})

which holds when we the MCMC estimators enjoy central limit theorems.

In practice we could check for

\left| \hat{f}_{1} / \sigma_{\hat{f}_{1}} - \hat{f}_{2} / \sigma_{\hat{f}_{2}} \right| > 5 \cdot \sqrt{2}

although we’d have to be careful about multiple comparisons and which would require adding something like \sqrt{N_{\mathrm{comp}}} on the right hand side.

4 Likes

This question came up again because I discovered in testing the new compiler (which was accidentally being run against the Math library from a week or two ago) that the Math library changed somehow for one of our models, by about 2e-8 probably do to changes to gamma (as accessed via inv_gamma). So it can be good to have an answer to this question in general :)

Thanks for laying this out so clearly! So maybe for models we strongly suspect are ergodic (e.g. all of your increasingly-useful stat_comp_benchmarks) we can use the formula you have relating the expectations and standard deviations with the seemingly-arbitrary (😜) 5 \cdot \sqrt{2} with the adjustment for multiple comparisons. Am I reading that right that \hat{f} is any arbitrary expectation?

I think this might become actually super important for verifying the optimization work that @Matthijs has been doing - there we expect floating point operations to be re-ordered and changed substantially for the sake of performance or numerical precision in a way that is mathematically but not floating-point-arithmetic equivalent. So thanks all :)

5 \sqrt{2} is a five-\sigma threshold corresponding to a tail probability of 3 \cdot 10^{-7}. In other words if were only testing one expectation and all the ideal conditions held and the two samplers were returning equivalent results then the false positive rate for using that threshold would be 3 \cdot 10^{-7}. We’d have to run 10 million tests before we got one false positive.

The true positive rate would depend on how samplers differed, but if we require reasonably small MCMC standard errors (i.e. hundreds of effective samples) then even small differences would be pretty easy to see.

All of this should hold for the estimated expectation value of any function f provided that both true expectation value \mathbb{E}[f] and \mathbb{E}[f^{2}] are well-defined and finite.

That said, all of this presumes that we don’t know what the true expectation value is and are just trying to compare two samplers. If we know know what the true expectation values are then we can test the MCMC central limit theorem directly. Indeed we can test it for multiple chains and look at the entire distribution of tail probabilities. This allows the false positive rate to be tuned exactly while maintaining high sensitivity to problems.

This was the motivation of the validation written up in this very old branch,
GitHub - stan-dev/stan at feature/stat_valid_test. It never went anywhere because the validation runs couldn’t be done in pure C++ (we still need in-memory, pure C++ IO).

1 Like

So just to make sure I understand, if you know the expected values you can hardcode them in, calculate a standardized value (sample_mean - expected_mean) / se (like this) and then use a student t test on that value. But in that code it looks like you were using 0.1 which seems like it would have a false positive rate much higher than once every 10 million tests :P Am I missing something or is that code just using a lower bar?

To be more formal you compute the p-value of the estimated mean using the Student-t CDF and then you repeat, looking at the distribution of p-values. How many p-values are below/above a given threshold then follows a binomial distribution for which you can compute tail probabilities once again and do a null hypothesis significance test.

This was a long time ago but I think that the 0.1 threshold gave a false positive rate of 1 in 10,000 while keeping the sensitivity to failures quite high.

1 Like