Code optimization & reduce_sum

Hi Stan users!

I’m modeling hierarchical tumor size measurements with the structure: measurement over time in tumor in organ in patient in study. I have about 150 000 measurements from 30,000 tumors across 22 studies. I’m trying to optimize runtime using within-chain parallelization via reduce_sum on an HPC cluster (up to ~80 cores per node).
Right now my partial_sum is sliced by study, and inside I loop per study, patient and tumor. For each tumor I extract its measurement segment and compute the longitudinal mean. Something like,

for (ss in 1:nstudy_sub) {
  int s = study[ss];
  for (j in 1:nb_patient_per_study[s]) {
    for (l in 1:nb_lesions_per_patient_per_study[j, s]) {

      vector[len] Yobs = Y[start_etude[s]:end_etude[s]]
                           [start_patient_etude[s,j]:end_patient_etude[s,j]]
                           [start_lesion[s,j,l]:end_lesion[s,j,l]];

      vector[len] Xobs = X[start_etude[s]:end_etude[s]]
                           [start_patient_etude[s,j]:end_patient_etude[s,j]]
                           [start_lesion[s,j,l]:end_lesion[s,j,l]];

      // ... compute Ypred and likelihood ...
    }
  }
}

Studies, patients, and tumors are highly imbalanced (not the same size every time). Some studies have many more patients/measurements than others, and within studies some patients have many more tumors and/or measurements than others.

My questions:

  1. Since I have study-, patient-, and tumor-level parameters (random effects at each level), is there a recommended way to structure the computation so that data slicing is efficient ?
  2. Is slicing by study still a good strategy for reduce_sum, or should I rewrite to slice at a finer level (measurement levels) to improve load balancing across many cores?
  3. Any best practices for indexing/extracting tumor-level measurement efficiently in Stan would be very welcome.

Thanks a lot for any guidance!

-S

There are two possibilities that each could potentially be the most efficient. One is to slice the tumor level paramater vector, to avoid needing to pass 30000 parameters back and forth between the cores. The other possibility is to slice the observations in order to keep the computational load balanced. Which is better might depend both on your hardware and on the details of the computation that you need to do at different levels of the model.

1 Like

Thanks for the feedback!

For context, I’m running rstan on a Linux sHPC, with up to ~80 cores per node and ~200 GB RAM (though large allocations can take a long time to schedule). I’m therefore trying to make reduce_sum efficient even with moderate core counts.

Just to confirm my understanding: switching from study-level slicing to tumor-level or observation-level slicing is not only a matter of changing the object passed to reduce_sum, but also requires adapting the partial_sum signature and internal loops so that each shard iterates over tumors (or observations) and maps them back to the corresponding patient/study parameters — is that correct?

In my model, ~30,000 tumor-level parameters are used to compute nonlinear, piecewise tumor dynamics over time (per tumor), which are then passed to a censored/truncated measurement likelihood. This is a nonlinear mixed-effects model, so all tumor dynamics are linked through shared study- and patient-level parameters. The number of observations per tumor varies substantially (from 1 up to ~40). I guess that tumor level slicing is preferable here, but any guidance will be greatly appreciated!

Currently, I compute all tumor-level parameters in the transformed parameters block to avoid recomputation. However, from a memory-traffic and threading perspective, I’m wondering whether it might be preferable to compute tumor-level parameters inside partial_sum on the fly, so that each shard only computes the parameters it actually needs.

Thanks a lot!

– S