Running cmdstanr in parallel on computing cluster

Hi Stan folks,

I’m trying to run Stan on a computing cluster, but I can’t seem to get the parallelization right.

On my local machine I regularly run jobs using parallel chains and/or within-chain parallelization via the reduce_sum() function. In the call to cmdstan_model() I specify cpp_options = list(STAN_THREADS = TRUE), and in the call to sample() I specify parallel_chains and threads_per_chain. Everything works, for example, if I run two parallel chains with four threads per chain, all eight cores on my laptop get used.

Over on the computing cluster this doesn’t seem to work. I think the cluster is a fairly standard one using PBS Pro job scheduler etc (https://wiki.chpc.ac.za/). It has lots of compute nodes, with 24 cpus available on each. I’ve tried to submit a job to a single node, using all 24 cores to run four chains and six threads per chain. The model compiles, creates an executable and samples, but when I ssh into the compute node during sampling and look at the activity there (with htop), only one cpu is active.

What is “the right way” to run Stan in parallel in this situation? The various Stan/cmdstanr pages here (Run Stan's MCMC algorithms with MPI — model-method-sample_mpi • cmdstanr) and here (15 Parallelization | CmdStan User’s Guide) and here (Stan Math Library: MPI) seem to suggest that maybe I should be using MPI and the sample_mpi() method, along with map_rect() (rather than reduce_sum()) for the within-chain parallelization? My nonexistent understanding of MPI is that it’s used for jobs that are spread across multiple compute nodes, rather than jobs within a single compute node?

Any help would be much appreciated!

if things work out for you locally, but not on the cluster, then maybe you are not requesting the resources correctly over PBS?

Can you sample in parallel multiple (say 10) chains? If that works, then threading should also work.

I wanted to look into this but haven’t gotten around. You are not the only one with this problem and I haven’t seen a solution yet. Could you try the proposed steps from here and tell us if that works?

Thanks for your comment @wds15. It’s possible that I am not requesting the resources correctly. Here’s my job script, does anything look obviously wrong with the requests for resources?

The line ‘#PBS -l select=1:ncpus=24:mpiprocs=1’ asks for use of one compute node, all 24 cores on the node, and 1 mpi process (I’m not sure what that really does, if anything).

#!/bin/bash
#PBS -l select=1:ncpus=24:mpiprocs=1
#PBS -P PROJECT_SHORTNAME
#PBS -q smp
#PBS -l walltime=36:00:00
#PBS -o /mnt/lustre/users/USER/stdout.txt
#PBS -e /mnt/lustre/users/USER/stderr.txt
#PBS -N RJob
#PBS -M USER_EMAIL
#PBS -m abe
 
# Add R module with cmdstanr
module add chpc/BIOMODULES cmdstan R
 
# explicitly calculate number of processes.
nproc=`cat $PBS_NODEFILE | wc -l`
 
# make sure we're in the correct working directory.
cd /mnt/lustre/users/USER
 
mpirun -np $nproc -machinefile $PBS_NODEFILE  R --slave -f fit_model.R

The R script fit_model.R uses cmdstanr and compiles the model (which is coded with reduce_sum() ) with the call

mod <- cmdstan_model(file, cpp_options = list(stan_threads = TRUE))

and samples with

fit <- mod$sample(data = data_list, chains = 4, parallel_chains = 4, threads_per_chain = 6)

If I request one mpi process per cpu with ‘#PBS -l select=1:ncpus=24:mpiprocs=24’ it appears to run 24 separate copies of the R script (cmdstanr’s starting message is printed 24 times to the stderr.txt file) and fails with an error.

Thanks @scholz. I’m afraid I don’t see how the link relates to my issue? The model does sample and the posterior looks correct (ie the same as on my local machine), but it doesn’t appear to use all the cpu resources at it’s disposal.

When we tried using cmdstanr for our simulation study on a cluster, it only used one core as well. Our cluster uses an R version that is optimized/dependent on the intel compiler and thus probably uses the MKL. We figured that not using more than one core could thus be related to the described threading env variables from the tweet.
This sounded like it was close enough to your problem (running fine locally but not on the cluster) that I figured they probably have the same root cause.

The issue seems to have been resolved with the help of the folk at the CHPC.

It appears that the mpirun command in the job script was interfering somehow? Removing it and just running the R script works perfectly for this case, using all 24 cpus on a single compute node. Here’s what the job script looks like now:

#!/bin/bash
#PBS -l select=1:ncpus=24
#PBS -P PROJECT_SHORTNAME
#PBS -q smp
#PBS -l walltime=36:00:00
#PBS -o /mnt/lustre/users/USER/stdout.txt
#PBS -e /mnt/lustre/users/USER/stderr.txt
#PBS -N RJob
#PBS -M USER_EMAIL
#PBS -m abe
 
# Add R module with cmdstanr
module add chpc/BIOMODULES cmdstan R
 
# make sure we're in the correct working directory.
cd /mnt/lustre/users/USER
 
R --slave -f fit_model.R