Run multiple chains with cmdstan as batch jobs using qsub on a cluster

Operating System: linux cluster
Interface Version: cmdstan
Compiler/Toolkit: qsub, PBS

Just want to keep a record about the captioned in case others would need to do this.

I attempted to run multiple chains following p.28 of the cmdstan-guide-2.19.1.pdf but the chains would run sequentially.

Following @bbbales2’s suggestion, I now run the multiple chains in parallel as batch jobs on the cluster:

  • In the .sh shell script to be submitted with the qsub command, put down
#PBS -t 1-4

cd ~/cmdstan-2.19.1

time  ../SSM0.5.2dev sample algorithm=hmc metric=dense_e adapt delta=0.8 \
id=$PBS_ARRAYID data file=../SSM_data.dat output file=samples$PBS_ARRAYID.csv
  • The -t 1-4 option of the PBS command schedules an array of jobs (1 to 4). The different job numbers will be captured by the environment variable $PBS_ARRAYID as explained in this page.

  • The second line changes the directory to cmdstan directory that must be the location to use the sampling command in the third line

  • the inclusion of time in the third line keeps track of the wall time. I need to run the compiled version of the .stan file (in my case SSM0.5.2dev) located in my home directory, which is a level up from the cmdstan directory.

  • (In my case only, I need to use the metric=dense_e option, instead of the default choice metric=diag_e.)

  • The output file=samples$PBS_ARRAYID.csv specifies that the sampling output files would be samples1.csv, ... samples4.csv located in the cmdstan directory (a path could have been added to place the output files elsewhere).

Note: Different clusters use different approaches to specify an array of batch jobs. See this post for the case of another cluster. See also this reply to the post.

1 Like

what’s the reason to manually time it instead of using the csv output?

It’s been a while since I ran cmdstan on our cluster w/ multithreading, but when I was it looked like the timed outputs in the CSVs weren’t matching up with the actual walltime for each chain (CSV time was much longer than actual runtime), so I was using time to time them. Maybe OP is having a similar issue?

Indeed not necessary. The time part came from McElreath’s use of it in this post on a related topic. I didn’t realize the csv output also reports the info.

Do you happen to know whether the command

bin/stansummary samples*.csv

has an option to report only the summary for selected parameters like rstan's

print(fit, c("alpha", "beta", ... ))

?

I am struggling to find a convenient way to examine the output from cmdstan.

You could send the summary output to a text file and then grep for the interesting parts

bin/stansummary samples*.csv &> summary.txt
for var in foo bar theta gamma; do
  grep "^$var" summary.txt
done

For the sample CSV files,

# find column indices
i=$(grep "^#lp__" output.csv | tr , \\n | nl | grep alpha | cut -f1 | column -t | tr \\n ,)

# print lp and columns for columns matching alpha
grep -v "^#" output.csv | cut -f${i}1 -d,

When testing a new model, I’ve found it useful to save_warmup=1 and run watch with the command

 tail -n20 output.csv | grep -v '^#' | cut -d, -f1-6 | tr , '\t' | column -t

to see stepsize & treedepth etc live

1 Like

Thanks for your guidance. The following variation of your suggestion serves my purpose:

bin/stansummary samples*.csv --sig_figs=3 &> summary.txt
grep -P "Inference" summary.txt; \
grep -P "iterations saved" summary.txt; \
grep -P "Warmup" summary.txt; \
grep -P "Sampling" summary.txt; \
grep -P "Mean" summary.txt; for \
var in lp__ accept_stat__ stepsize__ treedepth__ n_leapfrog__ divergent__ energy__ \
sd_y mu_u1 mu_alpha beta theta sd_season mu_season p g w d; do \
  grep -P "^$var[[:space:]]|^$var\[" summary.txt;  
done

This reports:

Inference for Stan model: SSM0_5_2dev_model
4 chains: each with iter=(400,400,400,400); warmup=(0,0,0,0); thin=(1,1,1,1); 1600 iterations saved.
Warmup took (7133, 7125, 7429, 6727) seconds, 7.89 hours total
Sampling took (3443, 3584, 3725, 3661) seconds, 4.00 hours total
                         Mean      MCSE    StdDev         5%        50%        95%     N_Eff   N_Eff/s     R_hat
lp__                 2.07e+04  5.76e-01  1.39e+01   2.07e+04   2.07e+04   2.07e+04  5.82e+02  4.04e-02  1.00e+00
accept_stat__        9.62e-01  2.75e-03  8.83e-02   8.48e-01   9.88e-01   9.99e-01  1.03e+03  7.16e-02  1.01e+00
stepsize__           1.79e-02  6.50e-04  9.20e-04   1.69e-02   1.79e-02   1.94e-02  2.00e+00  1.39e-04  5.60e+13
treedepth__          9.00e+00  5.04e-03  1.98e-01   9.00e+00   9.00e+00   9.00e+00  1.55e+03  1.08e-01  1.00e+00
n_leapfrog__         5.46e+02  3.59e+00  1.34e+02   5.11e+02   5.11e+02   1.02e+03  1.39e+03  9.63e-02  1.02e+00
divergent__          5.00e-03      -nan  7.06e-02   0.00e+00   0.00e+00   0.00e+00      -nan      -nan  9.99e-01
energy__            -2.05e+04  8.28e-01  1.91e+01  -2.06e+04  -2.05e+04  -2.05e+04  5.30e+02  3.68e-02  1.01e+00
sd_y                 8.01e-02  1.35e-05  5.74e-04   7.91e-02   8.01e-02   8.10e-02  1.80e+03  1.25e-01  9.98e-01
mu_u1                1.02e-01  1.99e-04  8.99e-03   8.65e-02   1.02e-01   1.16e-01  2.03e+03  1.41e-01  1.00e+00
mu_alpha             4.17e-02  4.61e-05  1.86e-03   3.87e-02   4.18e-02   4.47e-02  1.63e+03  1.13e-01  9.99e-01
beta                 5.93e-01  1.87e-04  7.52e-03   5.80e-01   5.93e-01   6.05e-01  1.62e+03  1.12e-01  1.00e+00
theta                1.53e-01  7.81e-05  3.24e-03   1.48e-01   1.53e-01   1.59e-01  1.71e+03  1.19e-01  1.00e+00
sd_season            1.02e-01  1.03e-04  4.25e-03   9.46e-02   1.01e-01   1.09e-01  1.70e+03  1.18e-01  1.00e+00
mu_season[1]        -1.26e-01  2.23e-04  1.05e-02  -1.44e-01  -1.26e-01  -1.09e-01  2.20e+03  1.53e-01  1.00e+00
mu_season[2]        -7.06e-02  2.23e-04  1.02e-02  -8.78e-02  -7.04e-02  -5.45e-02  2.07e+03  1.44e-01  1.00e+00
mu_season[3]         1.46e-01  2.42e-04  1.03e-02   1.28e-01   1.46e-01   1.63e-01  1.82e+03  1.27e-01  1.00e+00
p[1]                 6.45e-01  1.07e-03  3.83e-02   5.98e-01   6.38e-01   7.19e-01  1.27e+03  8.81e-02  1.00e+00
p[2]                 6.27e-01  1.20e-04  5.35e-03   6.18e-01   6.27e-01   6.36e-01  1.98e+03  1.37e-01  9.99e-01
p[3]                 5.97e-01  1.54e-03  5.48e-02   5.30e-01   5.87e-01   7.04e-01  1.27e+03  8.79e-02  1.00e+00
g[1]                 8.36e-01  1.09e-03  4.36e-02   7.65e-01   8.36e-01   9.10e-01  1.62e+03  1.12e-01  1.00e+00
g[2]                 3.37e-01  5.20e-04  2.02e-02   3.05e-01   3.37e-01   3.71e-01  1.51e+03  1.05e-01  1.00e+00
w[1]                 7.33e-01  1.53e-03  6.31e-02   6.24e-01   7.34e-01   8.35e-01  1.70e+03  1.18e-01  9.99e-01
w[2]                 1.41e-01  3.11e-04  1.26e-02   1.20e-01   1.41e-01   1.62e-01  1.66e+03  1.15e-01  9.99e-01
w[3]                 5.94e-01  4.87e-04  1.95e-02   5.61e-01   5.94e-01   6.25e-01  1.60e+03  1.11e-01  9.99e-01
d[1]                 6.68e-02  2.74e-04  1.07e-02   4.91e-02   6.69e-02   8.42e-02  1.54e+03  1.07e-01  1.00e+00
d[2]                 6.98e-01  2.17e-04  9.15e-03   6.83e-01   6.98e-01   7.13e-01  1.78e+03  1.23e-01  9.99e-01
d[3]                 2.35e-01  2.62e-04  1.10e-02   2.17e-01   2.35e-01   2.53e-01  1.78e+03  1.23e-01  1.00e+00

just curious why you’d do that when grep "^$var" would find what you want? is it not specific enough?

I have a bunch of “derived” parameters due to non-centered reparametrization of correlated parameters, such as p_mu, p_sd, p_L, p_err for the original parameter p. My primary interest is in the original parameters. The variation lets me filter out those derived parameters.

Hello @maedoc,

Do you know a simple way that I can capture R_hat in two decimal places?

The command bin/stansummary samples*.csv &> summary.txt reports R_hat only in one decimal place (1.0e+00). But I understand that the threshold for R_hat to watch for should be 1.01.

Run stansummary without arguments and it explains how to change the number of significant digits. You can also dump to a csv file which is useful.

1 Like

Thanks for the info. Appreciated.