I’m curious if you can impart some high level wisdom about testing probabilistic software. I have two questions:

How do you test code that samples random variables?

I’d like to start implementing some samplers myself in a pet project distributions, but also want to write tests for the fancier random graph sampler fastRG (code).

I pretty much have no clue where to start with this, and am not quite C++ fluent enough to learn by reading the Stan source (also I’m not sure what part of the source to look at).

How do you test for performance regressions?

I’m re-writing some of the slow R portions of the samplers in C++, but want to be sure that this actually results in faster code.

Figured y’all would be good people to ask about this kind of thing.

Hypothesis testing. The only tricky issue is generating tests with the right power to catch the right mistakes and then calculating a threshold to control the false positive rate as we want these tests automated.

If you don’t want to work from first principles, John Cook wrote a nice overview as a chapter in Beautiful Testing called Testing a Random Number Generator.

Stan uses chi-squared tests on marginals with equal probably sized bins. For instance, here’s a test for the bernoulli_rng function:

TEST(ProbDistributionsBernoulli, chiSquareGoodnessFitTest) {
boost::random::mt19937 rng;
int N = 10000;
std::vector<double> expected;
expected.push_back(N * (1 - 0.4));
expected.push_back(N * 0.4);
std::vector<int> counts(2);
for (int i = 0; i < N; ++i) {
++counts[stan::math::bernoulli_rng(0.4, rng)];
}
assert_chi_squared(counts, expected, 1e-6);
}

We compute expected counts of 0s and 1s as N * 0.6 and N * 0.4 for bernoulli(0.4). Then we generate a bunch of random draws, count the number of 0 and 1 draws (using them as indexes), then calls the chi-squared test function and test against 1e-6 as a threshold p-value below which to report an error.

I should’ve also added that if what you want to do is test posterior inference, you want to use simulation-based calibration as described in this paper.

With lots of iterations, you get an empirical distribution for the new system and the old system. You can then do more hypothesis testing, usually with the null hypothesis that the old system is at least as fast as the new system.

Definitely a biggie is simulating data from the model (or subsets of).

Other things that would help are using a random seed, visualization, making sure the deterministic functions are properly implemented by testing specific numerical inputs.

I’ve translated a pretty complex sampler between languages with just these tools.

Also making subsets if your code are sampling fine. Like if you have metropolis within Gibbs go ahead and make sure your metropolis is running fine and is recovering parameters from simulated data.

I see no reason to use hypothesis testing. Really there’s no reason to look at hypotheses testing. At all.

This wouldn’t be efficient, but say we generate a large number of samples, N from a distribution, conditional on parameter(s) \theta. Take an interval on the domain of each distribution [a,b]. Then \int_{a}^b p_1(\theta) \approx \int_{a}^b p_2(\theta) where p_1, p_2 are generators from packages 1 and 2. If we keep taking intervals, and we notice a consistent discrepancy, than we should be worried. If the error if consistently <\epsilon, then we can attribute that to monte carlo error or numerical issues, other wise, we have a problem. It might help to look at a distribution of the residuals.

I ran it by a mathematician, with no political agenda. Instantly, they said you’re showing point wise convergence between two CDFs, showing the two distributions are identical.

Formally, we’re showing that for an arbitrary b in the domain of f_1, f_2 that as N \rightarrow \infty, where N is the number of samples from each distribution |\int_{-\infty}^{b} f_1 - \int_{-\infty}^{b} f_2 | < \epsilon implies pointwise convergence.

There’s one mistake I made in the initial reply, it’s that the integrals should be from [-\infty, b] and not from a

So @alexpghayes, I would go ahead and add this to your statistical computation tool box.

Also, here’s where the tools come from: Rudin’s Principals of Analysis, and if you’re comfortable, Rudin’s Real and Complex Analysis. And any basic probability.

This is not a matter political agenda nor real analytical agenda.

With only finite computation you cannot take the N \rightarrow \infty limit. All you can do in this direction is generate samples from the two generators for a given finite N and then compare the resulting _empirical_cumulative distribution functions. In order to make a formal comparison between the generators themselves you have to take into account the variation of the empirical cumulative distribution functions around the true cumulative distribution functions, which ultimately takes the form of a tail probability of the sampling distribution of some statistic of the two empirical distribution functions. This, for example, is how the Kolmgorov-Smirnov is derived.

For finite N you always have sampling variation, and in order to discriminate between expected variation and actual discrepancies you need to consider tail probabilities and end up with something that looks like a hypothesis test.

The unit tests we have for RNGs in the Stan math library are hypothesis tests where the null hypothesis that the draws being tested come from the appropriate distribution.

We can’t test (-\infty, a) for all a or even infinitely many a, so instead we choose finitely many break points and look at the intervals between them. For example, we can pick 11 points evenly spaced at inverse cdf of [0, 0.1, 0.2, \ldots 0.9, 1.0], and what we expect is that the bins have roughly evenly spaced random values.

As @betanalpha points out, with only finitely many draws we have to leave the pleasant world of asymptotic real analysis proofs and deal with sampling uncertainty. Then we have to control our false-discovery rate. That’s where the quantitative side of the hypothesis testing component comes in.

There’s no frequentist political agenda here. It’s just that (Markov chain) Monte Carlo draws satisfy the required preconditions to make the hypothesis tests well calibrated. So we use them.