Simulation-based calibration case study with RStan

@avehtari, following up on the short discourse regarding CI’s, just to make sure to see where the analytic “difficulty” is located:

If we denote by \mathcal{C}_k the number of (integer) ranks that are less or equal to k (integer) with 0 \leq k \leq L. Don’t we have for \ell (integer) with 0\leq \ell \leq N, under the (Null)-hypothesis that the N ranks are i.i.d. uniform on the integers \{0,1,\dots, L\}:

\mathbb{P}[\mathcal{C}_k \leq \ell ] = I_{1-\frac{k+1}{L+1}}(N-\ell, 1+\ell)

where N is the number of synthetic datasets generated as part of the SBC algorithm and L the number of (uncorrelated) posterior samples generated, given a synthetic dataset. I is the regularized incomplete beta function. This comes via the cumulative distribution function of a Bionomial with number of trials equal to N and success probability equal to (k+1)/(L+1), evaluated at \ell.

Now, the empirical cumulative distribution function (ECDF) at rank k, denoted by \mathcal{E}_k\in[0,1]\subseteq \mathbb{R}, is distributed as \mathcal{C}_k/N thus

\mathbb{P}[\mathcal{E}_k\leq \frac{\ell}{N}] = \mathbb{P}[\mathcal{C}_k \leq \ell ] = I_{1-\frac{k+1}{L+1}}(N-\ell, 1+\ell)

In theory we could use this to get the intended CI, given the null hypothesis of N ranks uniformly distributed over the integers {0, 1,\dots,L}:

So, for each 0\leq k \leq L we need to find integers \ell_k \geq \ell_k' such that (say)

\mathbb{P}[\frac{\ell'_k}{N}\leq \mathcal{E}_k] = 0.05 \Leftrightarrow \mathbb{P}[\frac{\ell'_k}{N}\geq \mathcal{E}_k] = 0.95

and

\mathbb{P}[\frac{\ell_k}{N}\geq \mathcal{E}_k] = 0.05

to get a 90\% CI. (For each k): So for the first case, couldn’t we start at \ell_k'=N and stepwise decrease \ell_k' by one, until we find that \mathbb{P}[\frac{\ell'_k}{N}\geq \mathcal{E}_k]\leq 0.95? Similar, for the second case, we would start at \ell_k=0 and increase by one, until the first time we have \mathbb{P}[\frac{\ell_k}{N}\geq \mathcal{E}_k] \geq 0.05?

Thanks for the very clear presentation

It seems you are not certain of this? Well, I think you are right on what you present.

cumulative distribution function of the beta distribution

and the envelope is using beta distribution, but we were just ignoring discreteness C_k/N

The difficult part comes when want examine all k at the same time. Your derivation gives the interval for one k independently, but C_k’s are not independent. It would be great to have non-simulation based solution which would take into account the dependency between C_k’s. Due to the existence of simulation based approaches like in Aldor-Noiman paper, I’ve assumed that there is no easier solution.

I have question.
Is the following code of SBC available for an arbitrary stanfit object ?

I want to run the SBC for my fitted model object fit of class stanfit created with data dataList

I am not sure what is ‘M’ and what is suitable M and init_r
But I think the code like as follows;

sbc(
fit@stanmodel,
dataList,
M=11 # I am not sure, this is the sample size of pseudo uniform distributuion ?
)

It will work for any Stan program that adheres to the SBC conventions listed in the helpfile and the vignette. M is the number of times to draw from the prior distribution and so it should be much more than 11.

1 Like

Thank you for reply. I will try to use your SBC program.

I had also tried to implement SBC, but drawing step, I think my seed selection is not so mixed and cause the non-uniformity, or, my model contains actual bias.

It’s best to control the seed (or chain ID) directly. The SBC.R code does this as follows (where S is the variable used for the seed) in the code:

    S <- seq(from = 0, to = .Machine$integer.max, length.out = M)[m]

I’m not an R expert, but that looks to me like a roundabout way to compute

  S <- m - 1
1 Like

If M were 5, then

seq(from = 0, to = .Machine$integer.max, length.out = 5)

yields

[1]          0  536870912 1073741824 1610612735 2147483647

2 Likes

Thank you for letting me know the R scripts for the random seed selection !! Using your suggesting code, I will try to write SBC algorithm.

Thank you for letting me know. I learn a lot form you many times.

Thanks. I should’ve run it rather than guessing. I’m very bad at guessing what R functions are going to do.

Is there a reason not to use 0:4 instead of these highly spread out numbers? Do seeds that are closer somehow windup generating draws that are closer? My ignorance of PRNGs runs deep!

1 Like

If your data block has something like

transformed data {
  real mu_ = std_normal_rng();
  real sigma_ = exponential_rng();
  vector[N] y;
  for (n in 1:N) y[n] = normal_rng(mu_, sigma_);
}

Then my intuition is that if you started one chain with a seed of 0 and another with a seed of 1, then the y realizations would be the same except offset by one index position. So, that is why I went with maximal spacing of the seeds.

2 Likes

Generally pseudo random number generators don’t work like that. That would indicate that the sequence would repeat after .Machine$integer.max steps. The better prngs have much more states than .Machine$integer.max and integer seeds represent only small portion of those states and those states are very far away. Stan uses ecuyer1988, which has length of cycle 2^{61} and .Machine$integer.max is 2^{31}, so there are plenty of pseudo random numbers between each seed.

Mathworks did some carefully analysis years ago choosing their default prng, which is now Mersenne twister. Mersenne twister mt19937 is also what boost recommends as the first choice. It has length of cycle 2^{19937}-1. Use of ecuyer1988 in Stan has been questioned before Quality of Pseudo-Random Number Generators, but for some reason only other than boost library solutions (e.g. mt19937) were considered https://github.com/stan-dev/stan/issues/403
mt19937 would be also 13 times faster than ecuyer1988, but I guess speed has not been a problem.

1 Like

Mersenne is the typical default for most implementations for the reasons that @avehtari mentions, despite its relatively large state.

My understanding of why we went with ecuyer1988 instead of the twister is that it can be strided – offsets in the seed yield guarantees on how independent the resulting sequences appear. This is why we have chain_id – the actual seed used for each chain is seed + chain_id so that each chain will have independent PRNGs, at least within the first cycle.

For something like SBC we only need seed variation amongst chains – the changing data will ensure variation in the posterior samples from iteration to iteration.

If we wanted to consider moving beyond ecuyer1998 then it would be worthwhile to look into http://www.pcg-random.org which allows for PRNGs that can be strided (i.e. that admit multiple streams) with relatively small state and fast execution and configurable periods.

2 Likes

That was the reasoning, specifically because @bgoodri was concerned about using proper parallel RNGs (ones that can be strided ahead efficiently).

Neither speed nor randomness seem to be a problem with what we have. The other candidate that @bgoodri was recommending was

The only reason we never got around to adding it was that it was an extra dependency and the one we have seems to work fine. If we do need to change RNGs, it won’t be a big deal as it has a pretty simple interaction with the rest of the code.

In case someone would like to have a parallel version of Bob’s SBC function, I copied from the rstanarm implementation of sbc() (the latter does not support the automatic adaption of thinning based on n_eff while ensuring a fixed number of output samples). Here we go:

sbc_parallel <- function(model, data = list(),
                sbc_sims = 1000, stan_sims = 999,
                init_thin = 4, max_thin = 64,
                target_n_eff = 0.8 * stan_sims) {
    num_params <- num_monitored_params(model, data) 
    ranks <- matrix(nrow = sbc_sims, ncol = num_params) 
    thins <- rep(NA, sbc_sims)
    sbc_sim <- function(n) {
        seed <- seq(from = 0, to = .Machine$integer.max, length.out = sbc_sims)[n]
        n_eff <- 0
        thin <- init_thin 
        ranks <- rep(NA,num_params) #matrix(nrow = sbc_sims, ncol = num_params) 
        while (TRUE) {
            fit <- sampling(model,
                            data = data,
                            seed=seed,
                            chains = 1,
                            iter = 2 * thin * stan_sims,
                            thin = thin,
                            control = list(adapt_delta = 0.99), 
                            refresh = 0)
            fit_summary <- summary(fit, pars = c("lp__"), probs = c())$summary 
            n_eff <- fit_summary["lp__", "n_eff"]
            if (n_eff >= target_n_eff || (2 * thin) > max_thin) break;
            thin <- 2 * thin
        }
        
        lt_sim <- rstan::extract(fit)$I_lt_sim
        for (i in 1:num_params)ranks[i] <- sum(lt_sim[ , i]) + 1
        c(thin,ranks)
    }
    #rslt <- lapply(1:sbc_sims, sbc_sim)
    rslt <- parallel::mclapply(1:sbc_sims, sbc_sim)
    for (n in 1:sbc_sims) {
        thins[n] = rslt[[n]][1]
        ranks[n,] = rslt[[n]][-1]
    }
    pval <- rep(NA, num_params) 
    for (i in 1:num_params)
        pval[i] <- test_uniform_ranks(ranks[ , i],
                                      max_rank = stan_sims + 1)
    list(rank = ranks, p_value = pval, thin = thins) 
}

Moreover I also copied the seed specification from @bgoodri, albeit I understand from the discussion above that the “max-spacing” approach for seeds might not be needed (so we could have set seed=n above as well, I guess, or leave it as it is in Bob’s original sbc implementation).

3 Likes

Condition n_eff >= target_n_eff has the problem that it doesn’t recognize antithetic chains, which should be thinned, too. Instead of having target_n_eff = 0.8 * stan_sims, it would be better to have, e.g., n_eff_reltol=0.2 and then condition (n_eff/stan_sims >= 1-n_eff_reltol && n_eff/stan_sims <= 1+n_eff_reltol)

2 Likes

Thanks Aki, didn’t think about this, I guess implicitly I assumed antithetic chains would not cause any problems in SBC.

This is one reason why the sbc function returns all the draws and then you can choose the thinning interval afterward when you do the plots.

1 Like

For completeness, here the corresponding parallel version that also works with antithetic chains:

sbc_parallel_antithetic <- function(model, data = list(),
                sbc_sims = 1000, stan_sims = 999,
                init_thin = 4, max_thin = 64,
                n_eff_reltol=0.2) {
    num_params <- num_monitored_params(model, data) 
    ranks <- matrix(nrow = sbc_sims, ncol = num_params) 
    thins <- rep(NA, sbc_sims)
    sbc_sim <- function(n) {
        seed <- seq(from = 0, to = .Machine$integer.max, length.out = sbc_sims)[n]
        n_eff <- 0
        thin <- init_thin 
        ranks <- rep(NA,num_params) #matrix(nrow = sbc_sims, ncol = num_params) 
        while (TRUE) {
            fit <- sampling(model,
                            data = data,
                            seed=seed,
                            chains = 1,
                            iter = 2 * thin * stan_sims,
                            thin = thin,
                            control = list(adapt_delta = 0.99), 
                            refresh = 0)
            fit_summary <- summary(fit, pars = c("lp__"), probs = c())$summary 
            n_eff <- fit_summary["lp__", "n_eff"]
            if ((n_eff/stan_sims >= 1-n_eff_reltol && n_eff/stan_sims <= 1+n_eff_reltol) || (2*thin) > max_thin) break;
            thin <- 2 * thin
        }
        
        lt_sim <- rstan::extract(fit)$I_lt_sim
        for (i in 1:num_params)ranks[i] <- sum(lt_sim[ , i]) + 1
        c(thin,ranks)
    }
    rslt <- parallel::mclapply(1:sbc_sims, sbc_sim)
    for (n in 1:sbc_sims) {
        thins[n] = rslt[[n]][1]
        ranks[n,] = rslt[[n]][-1]
    }
    pval <- rep(NA, num_params) 
    for (i in 1:num_params)
        pval[i] <- test_uniform_ranks(ranks[ , i],
                                      max_rank = stan_sims + 1)
    list(rank = ranks, p_value = pval, thin = thins) 
}
1 Like

I’d suggest looking at @bgoodri’s implementation in RStan. Mine was more meant for tutorial purposes to explain how SBC works, not be the engine people use to actually do it in practice!