Thank you very much for the reply!!!
Any help is more than welcome, so please don’t hesitate to ask for additional information or clarification if needed, and feel free to double- or triple-check my reasoning :)
I’m a pharmacist by training, so HPC and computer-science concepts are still something I’m actively learning.
1) Cluster constraints and job configuration
Per task, I have a hard limit of 90 cores.
One possible approach would be to run four separate jobs, each using 90 cores, with one chain per job. In that case, each chain could leverage reduce_sum, and I would merge the chains afterward. While this is technically feasible, I am somewhat concerned about managing many output files across the large number of model variants I plan to run. More importantly, I rely heavily on the generated quantities block for simulations and predictions, and I feel uneasy about running chains independently and recombining outputs afterward, although I’m not sure whether this concern is actually justified.
Another approach would be to run a single job with multiple tasks, for example four tasks (one per chain), each task using around 90 cores. This is the configuration that initially led me to consider MPI together with map_rect, as it feels conceptually cleaner to me: one job, one model, one set of outputs.
In practice, although the hard limit is 90 cores, I currently request only around 40 cores, mainly because queue times for 90 cores are very long, and because I am not yet convinced that increasing from 40 to 90 cores would lead to a meaningful reduction in wall time.
2) Expected scaling with additional cores
At this stage, I do not have strong empirical evidence that wall time will continue to decrease substantially as the number of cores increases, but given the size of the model, I suspect that additional parallelism might still be beneficial.
In a previous project (see poster
A_MERLAUD_Characterizing_the_organ-specific_association_between_tumor_dynamics_and_overall_survival_across_cancer_types_and_studies.pdf (1.6 MB)
), I analyzed trials separately. In one such trial, I had approximately 1,400 tumors measured longitudinally, nested within about 600 patients, and I fitted a multilevel joint model combining tumor dynamics and survival, with random effects at the lesion and patient levels. Using reduce_sum led to a substantial speed-up, but I still reached a wall time of around 8 hours.
I now plan to analyze all studies simultaneously, with roughly 30,000 tumors measured over time across 22 studies, three levels of random effects (tumor, individual, and study), and several fixed effects to assess covariates. Given the much larger number of observations, I initially assumed that it might be possible to make profitable use of more cores than before. In addition, in another thread (see Code optimization & reduce_sum), you suggested slicing at the tumor level rather than by study, which also seems to point in the direction of increased parallelism.
At this point, perhaps somewhat naively, I feel that I should either fully commit to a reduce_sum-based implementation or switch to MPI with map_rect.
3) Current runtimes
With a simplified version of the model, including only the tumor dynamics component, no survival model, random effects only, and no fixed effects, runtime was approximately three days for 1,000 iterations, including 750 warm-up iterations with 2 chains and 80 cores and a grainsize of 20.
I do plan of course to reduce the number of iterations, improve priors and initial values, and continue optimizing the model structure.
Given all this, would you still recommend avoiding MPI and map_rect in my case, or do you think there could indeed be a benefit despite the additional complexity?
Thank you again for your time and advice!!!
-S