SBC for the benchmark models

Hi all,

This follows from an earlier post here: SBC for arma model

I’ve been looking to use the 12 benchmark models as a means to validate new components in Stan. As a starting point, I have been doing SBC for the models using the current version of Stan.

The idea is to test things like SMC-Stan (which is currently a fork of CmdStan). I’ve therefore written some Python code which works with CmdStan to generate SBC histograms. The code, along with SBC-versions of the benchmark models, can be found here: https://github.com/PhilClemson/pySBC

The code works by running for a number of chains specified by the number of bins and expected bin count. The effective sample size is calculated for each chain separately (I think this makes sense, but happy to discuss), and the chains are then re-run with the same random seeds but with appropriate thinning based on the ESS. There’s also some error handling to account for things like random seeds where the chain doesn’t succesfully initialise, as well as time outs for chains that are taking a long time.

Results

The following models gave the expected uniform histograms.

eight_schools

garch

low_dim_corr_gauss

pkpd

The gp models also gave mostly uniform histograms apart from for the rho parameter.

gp_regr

gp_pois_regr

However, some of the models had non-uniform histograms, seemingly indicating bias in the model / algorithm (this includes the arma model from the previous post).

irt_2pl

low_dim_gauss_mix

low_dim_gauss_mix_collapse

For the latter two models I tried running for 10000 sampling iterations (vs 1000) but I didn’t see much improvement.

low_dim_gauss_mix (10000 samples)

low_dim_gauss_mix_collapse (10000 samples)

For the remaining two models (arK and sir) I wasn’t able to perform SBC in a reasonable time frame due to the tight constraints in the models and the difficulty in finding good random seeds.

Questions

I guess the main request would be if someone like @avehtari or @hyunji.moon could confirm that I’m doing SBC correctly. For example, I’ve had some questions around the ESS criteria previously (see Effective Sample Size in SBC) so it would be good to check that I’m not doing anything silly there.

If that looks good then it might also be worth checking the simulation code for the models that seem to fail SBC. @martinmodrak already spotted a mistake in the original arma_sbc code, so it might be that there are others lurking in those files (I apologise in advance if there’s any as obvious as that one!).

Also happy to discuss any other ideas people might have!

4 Likes

I think effective sample size (ESS) should be for all chains combined as Loo implementation.

Considering the purpose of SBC which is to test the model and computation tool as a whole, it might be better not to perceive each chain separately. For instance, SBC1 tests your {AR model, HMC with four chains (sampling function)} and SBC2 tests {AR model, variational inference (optimizing function)}. Ideally, if full convergence is guaranteed, the number of chains has little effect in the inference (ie. sampled parameters), but if this is not the case, HMC samplers with different number of chains should be perceived as different computation tools. In this case, ESS which represents the efficiency of your computation tool should be the combined value of each chain. Figure 13 from this paper which shows one ESS value for different number of chains could be one example.

I get the following message for most models with your readme example code; do you know what might be the cause?

Chain 1 encountered errors. Re-running with a different seed.
Chain 2 encountered errors. Re-running with a different seed.
Chain 3 encountered errors. Re-running with a different seed.
Chain 4 encountered errors. Re-running with a different seed.
...
1 Like

Thanks @hyunji.moon, this has really helped my understanding! I’ve been kind of stuck in the mindset of doing everything on an individual chain basis because (at least in CmdStan) each chain generates its own data based on the random seed. However, I’ve since realised that you can generate the simulated data in a separate Stan file and then feed that into different chains in a data file. I’ll have a go at trying to implement this multiple-chain approach.

This is a general message for whenever the process for the chain exited without the normal return code, which can happen e.g. when initialisation fails. However, it looks like most of these messages are due to the random seed value being too large because of a recent change I made (it seems the max value is that of a 32-bit signed int, rather than unsigned!). I’ve fixed it in the latest commit.

2 Likes

Some of the posteriors you use are multimodal. If the individual chains don’t mix between the modes (ie they are stuck in one mode), you will get good ESS when computing it for individual chains but bad ESS if you compute it jointly for many chains that are stuck in different modes. If the chains are stuck in different modes then the probability that each chain is in some mode is unlikely to follow the relative posterior masses of each mode and SBC fails. Thus you need to check that you sampling is mixing well between the modes to get SBC to provide useful results.

In gp_regr I would check the sensibility of the prior compared to the data. If the prior includes such rho values that correspond to length scales smaller than larger than the range of the data, then the likelihood is not informative and there can be computational problems.

The revised SBC paper suggests computing ESS for different quantiles of each parameter and then choosing just one thinning value for all parameters.

3 Likes

Thanks, these are good points. I especially hadn’t thought about the multimodality issue.

Just so I fully understand, do you mean using one thinning value for all simulations based on the minimum ESS observed over the quantiles? I only ask as one could imagine different posteriors give rise to different minimum ESS. Having a higher than necessary ESS doesn’t make a difference to the result of SBC but it does have an impact on computational cost, which I’m keen to keep as low as possible! :)

One thinning value for all parameters of one posterior. You could use also different thinning values for different parameters of one posterior, but that is more work.

1 Like