I’m trying to run Stan on a computing cluster, but I can’t seem to get the parallelization right.
On my local machine I regularly run jobs using parallel chains and/or within-chain parallelization via the reduce_sum() function. In the call to cmdstan_model() I specify cpp_options = list(STAN_THREADS = TRUE), and in the call to sample() I specify parallel_chains and threads_per_chain. Everything works, for example, if I run two parallel chains with four threads per chain, all eight cores on my laptop get used.
Over on the computing cluster this doesn’t seem to work. I think the cluster is a fairly standard one using PBS Pro job scheduler etc (https://wiki.chpc.ac.za/). It has lots of compute nodes, with 24 cpus available on each. I’ve tried to submit a job to a single node, using all 24 cores to run four chains and six threads per chain. The model compiles, creates an executable and samples, but when I ssh into the compute node during sampling and look at the activity there (with htop), only one cpu is active.
I wanted to look into this but haven’t gotten around. You are not the only one with this problem and I haven’t seen a solution yet. Could you try the proposed steps from here and tell us if that works?
If I request one mpi process per cpu with ‘#PBS -l select=1:ncpus=24:mpiprocs=24’ it appears to run 24 separate copies of the R script (cmdstanr’s starting message is printed 24 times to the stderr.txt file) and fails with an error.
Thanks @scholz. I’m afraid I don’t see how the link relates to my issue? The model does sample and the posterior looks correct (ie the same as on my local machine), but it doesn’t appear to use all the cpu resources at it’s disposal.
When we tried using cmdstanr for our simulation study on a cluster, it only used one core as well. Our cluster uses an R version that is optimized/dependent on the intel compiler and thus probably uses the MKL. We figured that not using more than one core could thus be related to the described threading env variables from the tweet.
This sounded like it was close enough to your problem (running fine locally but not on the cluster) that I figured they probably have the same root cause.
The issue seems to have been resolved with the help of the folk at the CHPC.
It appears that the mpirun command in the job script was interfering somehow? Removing it and just running the R script works perfectly for this case, using all 24 cpus on a single compute node. Here’s what the job script looks like now:
#PBS -l select=1:ncpus=24
#PBS -P PROJECT_SHORTNAME
#PBS -q smp
#PBS -l walltime=36:00:00
#PBS -o /mnt/lustre/users/USER/stdout.txt
#PBS -e /mnt/lustre/users/USER/stderr.txt
#PBS -N RJob
#PBS -M USER_EMAIL
#PBS -m abe
# Add R module with cmdstanr
module add chpc/BIOMODULES cmdstan R
# make sure we're in the correct working directory.
R --slave -f fit_model.R