Thanks for sharing, in particular the difference with tcsh
.
Under the Slurm job scheduler, an array job is done similarly, though many large Slurm clusters now use the node as the unit of resources to allocate, so one does multiple chains within the job. FWIW, my script looks like this for a hyperparameter sweep,
#!/bin/bash
#SBATCH -C gpu -t 12:00:00
#SBATCH -J vep-sd-lsp
#SBATCH --array=1-21
id=${SLURM_ARRAY_TASK_ID:-"1"}
rm -fv data/${id}.R
cp data.R data/${id}.R
echo 'log_sd_lsp <- ' $(grep "^\s*${id}\s" log_sd_lsp.txt | cut -f2) >> data/${id}.R
rm -fv sample/${id}.*.csv
for i in {1..12}
do
./model sample num_warmup=1000 num_samples=1000 \
data file=data/${id}.R output file=sample/${id}.${i}.csv refresh=50 \
&> logs/stan-${id}.${i}.txt &
done
echo `date` 'waiting for chains to finish'
wait
rm -fv summary/${id}.* diagnose/${id}.txt
# ./stan* are symlinks to binaries in cmdstan-*/bin folder
./stansummary --csv_file=summary/${id}.csv sample/${id}.*.csv &> summary/${id}.txt &
./standiagnose sample/${id}.*.csv &> diagnose/${id}.txt &
echo `date` 'waiting for stansummary & diagnose to finish'
wait
echo `date` 'cleaning up'
rm -fv sample/${id}.*
tar cvjf logs/stan-${id}.tbz logs/stan-${id}.*.txt
rm -fv logs/stan-${id}.*.txt
This is then submitted with sbatch run.sh
.
I’m tempted to say that this may be an opportunity for additional tooling wrt. CmdStan, but workflows and such become so opinionated that common abstractions are difficult to identify.