RStan parallelising using reduce_sum()

Hi there, I am a total newbie to Stan and I was wondering if anybody had any ideas on how to parallelise this code by David Phillipo to run over multiple cores? The issue I have is that the models run extremely slowly and often we have to batch process 50+ of these which soon becomes unmanageable. I wish I could share an attempt but I’m not even really sure where to begin other than some sense that I should be using reduce_sum(). There is a link to the code below and I’ve pasted the section of the code that I think can be parallelised underneath. Any help gratefully received, even if it’s just a hint to get started.

Link to code: multinma/binomial_2par.stan at master · dmphillippo/multinma · GitHub

Code section that I think can be parallelised:

 // -- IPD likelihood --
  if (link == 1) { // logit link
    // Could replace with bernoulli_logit_glm in Stan > 2.20
    ipd_r ~ bernoulli_logit(eta_ipd);
  } else {
    ipd_r ~ bernoulli(theta_ipd);

  // -- AgD likelihood (arm-based) --
  // We have to hand code the log likelihood contribution for the adjusted
  // binomial here, as N is not necessarily an integer (which Stan doesn't
  // like). The following is exactly equivalent to:
  // ag_r ~ binomial(nprime, pprime);
  for (i in 1:ni_agd_arm)
    target += lchoose(nprime[i], agd_arm_r[i]) +
              lmultiply(agd_arm_r[i], pprime[i]) +
              (nprime[i] - agd_arm_r[i]) * log1m(pprime[i]);
generated quantities {
  vector[ni_agd_arm * n_int_thin] theta2_bar_cum;
#include /include/generated_quantities_theta_fitted.stan
#include /include/generated_quantities_common.stan
#include /include/generated_quantities_theta.stan

  // IPD log likelihood and residual deviance
  for (i in 1:ni_ipd) {
    log_lik[i] = bernoulli_lpmf(ipd_r[i] | theta_ipd[i]);
    resdev[i] = -2 * log_lik[i];
    fitted_ipd[i] = theta_ipd[i];

Couple of quick things

  1. If batch processing 50+ of these becomes unmanageable, then it sounds like you’re already limited by the number of cores you have available. Parallelizing to use more cores won’t help; in fact it’ll slow things down.
  2. You can check whether the loop in the model block or the loop in the generated quantities block is the bottleneck using profiling.
  3. The for-loop over i in the model block looks like a candidate for parallelization. The efficient way to do this will depend on what of the arguments are parameters and what are data.
  4. If the generated quantities calculation is a bottleneck and you have enough resources available to parallelize, you might consider fitting the model without the generated quantities, and then parallelizing the computation of the generated quantities outside of Stan using the model output.