Different results between RStan and BayesianTools SMC for a non-latent-state model

I am currently comparing results obtained with RStan and BayesianTools for a model that does not involve latent states, and I observe systematic differences between the two when using SMC-based inference. The model is a static parameter model with Gaussian priors (including a multivariate normal block) and a beta-binomial observation likelihood. In the RStan implementation, inference is performed using NUTS, while in BayesianTools the same log-likelihood and log-prior are supplied via createBayesianSetup() and sampled using runSMC(). Despite matching the parameterization, prior specifications, and likelihood structure, posterior summaries from BayesianTools SMC differ noticeably from those obtained with RStan, and the discrepancy persists even when the number of particles is increased. I would like to ask whether such differences are expected for static models, or whether there are default tempering, scaling, or other internal SMC mechanisms in BayesianTools that could lead to a posterior target differing from that of a standard MCMC implementation.

model {
  // Priors
  // theta_sm ~ normal(0, 5);
  // theta_st ~ normal(0, 5);
  // theta_0_init ~ normal(0, 2);
  // theta_0_0 ~ normal(0, 2);
  // phi ~ exponential(0.01);
  vector[18] theta_joint;
  theta_joint = append_row(theta_sm, theta_st);
  // 
  target += 0.1 * normal_lpdf(theta_0_init | 0, 2);
  target += 0.1 * normal_lpdf(theta_0_0 | 0, 2);
  target += 0.1 * multi_normal_lpdf(theta_joint | mu_theta, sigma_theta);
  // likelihood
  for (t in (n_serial + 1):N) {
    
      if (I_new[t] > 0) {
        real alpha = 1 + I_new[t];
        real beta_param = 1 + pop - I_new[t];
        target += 0.001 * beta_binomial_lpmf(cases[t] | pop, alpha, beta_param);
      }
    
    // print(cases[t]);
  }
  
}

# =============================================================== # 
##### MCMC

ll = function(params, browse = F){
  if(browse) browser()
  
  b0.spline = generate_spline(y.in = params[19:length(params)], bs.in = b0_basis)
  
  mosq.biv = generate_biv_spline(coef.prop = params[1:9], dat.range = mosq.range)
  temp.biv = generate_biv_spline(coef.prop = params[10:18], dat.range = temp.range)
  
  beta.ts = c(rep(0, length(weights.serialinterval)), exp(sapply((length(weights.serialinterval) + 1):nrow(GZ.data), function(tt)
    calc_prop_beta(tt, biv.temp = temp.biv, biv.mosq = mosq.biv, browse = F, b0.spline.tmp = b0.spline))))
  
  I.new.sim = sim_epidemic(n.reps = 100, beta.vec = beta.ts)
  
  LL = lik_weather(sim = I.new.sim)
  if(is.na(LL)){
    LL = -(.Machine$double.xmax)/10
  }
  return(0.001 * (LL))
}

# prior function
p = function(params){
  p1 = dmvnorm(params[1:18], param.mean, makePositiveDefinite(sigma.mean), log = TRUE)
  p2 = dnorm(mean(params[19:51]), 0, 2, log = TRUE)
  return(0.001 * (sum(p1) + sum(p2)))
}
# sampler function
s = function(n1=1, n2=33){
  # browser()
  s1 = rmvnorm(n1, param.mean, makePositiveDefinite(sigma.mean))
  s2 = rnorm(n2, 0, 0.5)
  return(c(s1, s2))
}

fig3.pdf (584.5 KB)

1 Like

A common reason for discrepancies between Stan and other software is that Stan gives more or less the correct posterior for the given problem while other software does not :-D Though obviously, Stan isn’t guaranteed to be correct.

You don’t provide enough information to let us make any specific judgements, and I’ve never head of BayesianTools, but here are the things I would look at, roughly in order of increasing work and decreasing returns:

  1. Are all the Stan diagnostics good? (did you get divergent transitions? Is your Rhat and ESS good?)

  2. Apply the same convergence checks to the BayesianTools SMC output as Stan has (i.e. have multiple chains, use posterior::summarise_draws and check out Rhat and ESS values). Many Bayesian tools by default employ no or very weak convergence checks.

  3. Simulate a dataset with known parameter values - which posterior matches the known truth better?

  4. A more formal extension of the previous step is to run multiple such simulations and use [Simulation-based calibration checking]( Getting Started with SBC ‱ SBC ) to determine if a piece of software gives the correct posterior.

Hope that helps at least a little!

3 Likes

Thank you very much for your thoughtful and helpful reply — I really appreciate you taking the time to share these suggestions.

  1. Regarding your first point, the Stan diagnostics appear satisfactory: there are no divergent transitions, R^≀1.01 for all parameters, and although effective sample sizes < 400, they are generally acceptable given the model complexity and number of iterations.

  2. For the second point, the BayesianTools SMC output shows poor convergence with R^ > 1.01, with the largest around 1.7, suggesting incomplete convergence.

3 & 4. As for the simulation-based checks you suggested, I agree that these would be the most principled way forward, and I plan to explore both single simulated datasets and a more systematic SBC analysis in subsequent work.

Here is the Rhat for SMC and RStan. The SMC results are cordinates with the published paper with Rhat > 1.01 but close to 1.

1 Like

Rhat of 1.7 is very large, so I ‘d say this is very poor convergence and wouldn’t trust the SMC results at all, i.e. tye reason the results differ is that the SMC results are clearly incorrect, while Stan samples the posterior without much problems and (assuming the model is in fact coded correctly etc., etc.) is thus likely to be correct.

3 Likes

Thank you for the clarification! I totally agree with your assessment.

If you have the compute budget, you probably want to use more than 25 warmup iterations—Stan can’t do it’s usual three stages of estimation for mass matrix and step size with only 25 iterations—you would’ve gotten warning messages about that. Having said that, this problem looks very easy if you have an ESS of 200 with 100 post warmup draws (I think. your reporter truncates the max ESS at 200). Even the lp__ variable is mixing well, despite it having a too-low R-hat to really trust. I’d want to run until the R-hat on lp__ is also under 1.01, but I don’t think that’s going to change your results here. Also, it looks like a problem where a unit mass matrix will work fine.

The other thing to worry about here is that the ESS calculations can overestimate ESS when the sampler is trapped in a local region. For example, a well separated multimodal posterior will have fine ESS if all the chains are initialized in the same mode. Or if you have something like the funnel, and don’t explore the neck, the ESS and R-hat can look fine because it looks like you are exploring the places you explore.

The only real way to know if you’re getting the right answer is to simulate.