Running cmdstanr in parallel on computing cluster

mbc · December 8, 2022, 8:35am

Hi Stan folks,

I’m trying to run Stan on a computing cluster, but I can’t seem to get the parallelization right.

On my local machine I regularly run jobs using parallel chains and/or within-chain parallelization via the reduce_sum() function. In the call to cmdstan_model() I specify cpp_options = list(STAN_THREADS = TRUE), and in the call to sample() I specify parallel_chains and threads_per_chain. Everything works, for example, if I run two parallel chains with four threads per chain, all eight cores on my laptop get used.

Over on the computing cluster this doesn’t seem to work. I think the cluster is a fairly standard one using PBS Pro job scheduler etc (https://wiki.chpc.ac.za/). It has lots of compute nodes, with 24 cpus available on each. I’ve tried to submit a job to a single node, using all 24 cores to run four chains and six threads per chain. The model compiles, creates an executable and samples, but when I ssh into the compute node during sampling and look at the activity there (with htop), only one cpu is active.

What is “the right way” to run Stan in parallel in this situation? The various Stan/cmdstanr pages here (Run Stan's MCMC algorithms with MPI — model-method-sample_mpi • cmdstanr) and here (15 Parallelization | CmdStan User’s Guide) and here (Stan Math Library: MPI) seem to suggest that maybe I should be using MPI and the sample_mpi() method, along with map_rect() (rather than reduce_sum()) for the within-chain parallelization? My nonexistent understanding of MPI is that it’s used for jobs that are spread across multiple compute nodes, rather than jobs within a single compute node?

Any help would be much appreciated!

wds15 · December 8, 2022, 12:46pm

if things work out for you locally, but not on the cluster, then maybe you are not requesting the resources correctly over PBS?

Can you sample in parallel multiple (say 10) chains? If that works, then threading should also work.

scholz · December 8, 2022, 12:53pm

I wanted to look into this but haven’t gotten around. You are not the only one with this problem and I haven’t seen a solution yet. Could you try the proposed steps from here and tell us if that works?

mbc · December 9, 2022, 8:28am

Thanks for your comment @wds15. It’s possible that I am not requesting the resources correctly. Here’s my job script, does anything look obviously wrong with the requests for resources?

The line ‘#PBS -l select=1:ncpus=24:mpiprocs=1’ asks for use of one compute node, all 24 cores on the node, and 1 mpi process (I’m not sure what that really does, if anything).

#!/bin/bash
#PBS -l select=1:ncpus=24:mpiprocs=1
#PBS -P PROJECT_SHORTNAME
#PBS -q smp
#PBS -l walltime=36:00:00
#PBS -o /mnt/lustre/users/USER/stdout.txt
#PBS -e /mnt/lustre/users/USER/stderr.txt
#PBS -N RJob
#PBS -M USER_EMAIL
#PBS -m abe
 
# Add R module with cmdstanr
module add chpc/BIOMODULES cmdstan R
 
# explicitly calculate number of processes.
nproc=`cat $PBS_NODEFILE | wc -l`
 
# make sure we're in the correct working directory.
cd /mnt/lustre/users/USER
 
mpirun -np $nproc -machinefile $PBS_NODEFILE  R --slave -f fit_model.R

The R script fit_model.R uses cmdstanr and compiles the model (which is coded with reduce_sum() ) with the call

mod <- cmdstan_model(file, cpp_options = list(stan_threads = TRUE))

and samples with

fit <- mod$sample(data = data_list, chains = 4, parallel_chains = 4, threads_per_chain = 6)

If I request one mpi process per cpu with ‘#PBS -l select=1:ncpus=24:mpiprocs=24’ it appears to run 24 separate copies of the R script (cmdstanr’s starting message is printed 24 times to the stderr.txt file) and fails with an error.

mbc · December 9, 2022, 8:32am

Thanks @scholz. I’m afraid I don’t see how the link relates to my issue? The model does sample and the posterior looks correct (ie the same as on my local machine), but it doesn’t appear to use all the cpu resources at it’s disposal.

scholz · December 9, 2022, 9:17am

When we tried using cmdstanr for our simulation study on a cluster, it only used one core as well. Our cluster uses an R version that is optimized/dependent on the intel compiler and thus probably uses the MKL. We figured that not using more than one core could thus be related to the described threading env variables from the tweet.
This sounded like it was close enough to your problem (running fine locally but not on the cluster) that I figured they probably have the same root cause.

mbc · December 9, 2022, 5:35pm

The issue seems to have been resolved with the help of the folk at the CHPC.

It appears that the mpirun command in the job script was interfering somehow? Removing it and just running the R script works perfectly for this case, using all 24 cpus on a single compute node. Here’s what the job script looks like now:

#!/bin/bash
#PBS -l select=1:ncpus=24
#PBS -P PROJECT_SHORTNAME
#PBS -q smp
#PBS -l walltime=36:00:00
#PBS -o /mnt/lustre/users/USER/stdout.txt
#PBS -e /mnt/lustre/users/USER/stderr.txt
#PBS -N RJob
#PBS -M USER_EMAIL
#PBS -m abe
 
# Add R module with cmdstanr
module add chpc/BIOMODULES cmdstan R
 
# make sure we're in the correct working directory.
cd /mnt/lustre/users/USER
 
R --slave -f fit_model.R

Topic		Replies	Views
Multiprocessing and/or multithreading problem - CmdStanPy Modeling cmdstanpy , paralellization	12	116	January 2, 2025
Using Stan on a computing cluster. Any advice? CmdStan	20	5058	January 10, 2019
Correct way to use MPI with cmdstanpy Modeling	9	712	August 29, 2020
Four chains vs four jobs General cmdstan	28	215	June 19, 2024
Cmdstan cluster sampling speed CmdStan	3	83	January 10, 2025

Running cmdstanr in parallel on computing cluster

Related topics