Run multiple chains with cmdstan as batch jobs using qsub on a cluster

Operating System: linux cluster
Interface Version: cmdstan
Compiler/Toolkit: qsub, PBS

Just want to keep a record about the captioned in case others would need to do this.

I attempted to run multiple chains following p.28 of the cmdstan-guide-2.19.1.pdf but the chains would run sequentially.

Following @bbbales2’s suggestion, I now run the multiple chains in parallel as batch jobs on the cluster:

  • In the .sh shell script to be submitted with the qsub command, put down
#PBS -t 1-4

cd ~/cmdstan-2.19.1

time  ../SSM0.5.2dev sample algorithm=hmc metric=dense_e adapt delta=0.8 \
id=$PBS_ARRAYID data file=../SSM_data.dat output file=samples$PBS_ARRAYID.csv
  • The -t 1-4 option of the PBS command schedules an array of jobs (1 to 4). The different job numbers will be captured by the environment variable $PBS_ARRAYID as explained in this page.

  • The second line changes the directory to cmdstan directory that must be the location to use the sampling command in the third line

  • the inclusion of time in the third line keeps track of the wall time. I need to run the compiled version of the .stan file (in my case SSM0.5.2dev) located in my home directory, which is a level up from the cmdstan directory.

  • (In my case only, I need to use the metric=dense_e option, instead of the default choice metric=diag_e.)

  • The output file=samples$PBS_ARRAYID.csv specifies that the sampling output files would be samples1.csv, ... samples4.csv located in the cmdstan directory (a path could have been added to place the output files elsewhere).

Note: Different clusters use different approaches to specify an array of batch jobs. See this post for the case of another cluster. See also this reply to the post.

what’s the reason to manually time it instead of using the csv output?

It’s been a while since I ran cmdstan on our cluster w/ multithreading, but when I was it looked like the timed outputs in the CSVs weren’t matching up with the actual walltime for each chain (CSV time was much longer than actual runtime), so I was using time to time them. Maybe OP is having a similar issue?

Indeed not necessary. The time part came from McElreath’s use of it in this post on a related topic. I didn’t realize the csv output also reports the info.

Do you happen to know whether the command

bin/stansummary samples*.csv

has an option to report only the summary for selected parameters like rstan's

print(fit, c("alpha", "beta", ... ))

?

I am struggling to find a convenient way to examine the output from cmdstan.

You could send the summary output to a text file and then grep for the interesting parts

bin/stansummary samples*.csv &> summary.txt
for var in foo bar theta gamma; do
  grep "^$var" summary.txt
done

For the sample CSV files,

# find column indices
i=$(grep "^#lp__" output.csv | tr , \\n | nl | grep alpha | cut -f1 | column -t | tr \\n ,)

# print lp and columns for columns matching alpha
grep -v "^#" output.csv | cut -f${i}1 -d,

When testing a new model, I’ve found it useful to save_warmup=1 and run watch with the command

 tail -n20 output.csv | grep -v '^#' | cut -d, -f1-6 | tr , '\t' | column -t

to see stepsize & treedepth etc live

1 Like

Thanks for your guidance. The following variation of your suggestion serves my purpose:

bin/stansummary samples*.csv --sig_figs=3 &> summary.txt
grep -P "Inference" summary.txt; \
grep -P "iterations saved" summary.txt; \
grep -P "Warmup" summary.txt; \
grep -P "Sampling" summary.txt; \
grep -P "Mean" summary.txt; for \
var in lp__ accept_stat__ stepsize__ treedepth__ n_leapfrog__ divergent__ energy__ \
sd_y mu_u1 mu_alpha beta theta sd_season mu_season p g w d; do \
  grep -P "^$var[[:space:]]|^$var\[" summary.txt;  
done

This reports:

Inference for Stan model: SSM0_5_2dev_model
4 chains: each with iter=(400,400,400,400); warmup=(0,0,0,0); thin=(1,1,1,1); 1600 iterations saved.
Warmup took (7133, 7125, 7429, 6727) seconds, 7.89 hours total
Sampling took (3443, 3584, 3725, 3661) seconds, 4.00 hours total
                         Mean      MCSE    StdDev         5%        50%        95%     N_Eff   N_Eff/s     R_hat
lp__                 2.07e+04  5.76e-01  1.39e+01   2.07e+04   2.07e+04   2.07e+04  5.82e+02  4.04e-02  1.00e+00
accept_stat__        9.62e-01  2.75e-03  8.83e-02   8.48e-01   9.88e-01   9.99e-01  1.03e+03  7.16e-02  1.01e+00
stepsize__           1.79e-02  6.50e-04  9.20e-04   1.69e-02   1.79e-02   1.94e-02  2.00e+00  1.39e-04  5.60e+13
treedepth__          9.00e+00  5.04e-03  1.98e-01   9.00e+00   9.00e+00   9.00e+00  1.55e+03  1.08e-01  1.00e+00
n_leapfrog__         5.46e+02  3.59e+00  1.34e+02   5.11e+02   5.11e+02   1.02e+03  1.39e+03  9.63e-02  1.02e+00
divergent__          5.00e-03      -nan  7.06e-02   0.00e+00   0.00e+00   0.00e+00      -nan      -nan  9.99e-01
energy__            -2.05e+04  8.28e-01  1.91e+01  -2.06e+04  -2.05e+04  -2.05e+04  5.30e+02  3.68e-02  1.01e+00
sd_y                 8.01e-02  1.35e-05  5.74e-04   7.91e-02   8.01e-02   8.10e-02  1.80e+03  1.25e-01  9.98e-01
mu_u1                1.02e-01  1.99e-04  8.99e-03   8.65e-02   1.02e-01   1.16e-01  2.03e+03  1.41e-01  1.00e+00
mu_alpha             4.17e-02  4.61e-05  1.86e-03   3.87e-02   4.18e-02   4.47e-02  1.63e+03  1.13e-01  9.99e-01
beta                 5.93e-01  1.87e-04  7.52e-03   5.80e-01   5.93e-01   6.05e-01  1.62e+03  1.12e-01  1.00e+00
theta                1.53e-01  7.81e-05  3.24e-03   1.48e-01   1.53e-01   1.59e-01  1.71e+03  1.19e-01  1.00e+00
sd_season            1.02e-01  1.03e-04  4.25e-03   9.46e-02   1.01e-01   1.09e-01  1.70e+03  1.18e-01  1.00e+00
mu_season[1]        -1.26e-01  2.23e-04  1.05e-02  -1.44e-01  -1.26e-01  -1.09e-01  2.20e+03  1.53e-01  1.00e+00
mu_season[2]        -7.06e-02  2.23e-04  1.02e-02  -8.78e-02  -7.04e-02  -5.45e-02  2.07e+03  1.44e-01  1.00e+00
mu_season[3]         1.46e-01  2.42e-04  1.03e-02   1.28e-01   1.46e-01   1.63e-01  1.82e+03  1.27e-01  1.00e+00
p[1]                 6.45e-01  1.07e-03  3.83e-02   5.98e-01   6.38e-01   7.19e-01  1.27e+03  8.81e-02  1.00e+00
p[2]                 6.27e-01  1.20e-04  5.35e-03   6.18e-01   6.27e-01   6.36e-01  1.98e+03  1.37e-01  9.99e-01
p[3]                 5.97e-01  1.54e-03  5.48e-02   5.30e-01   5.87e-01   7.04e-01  1.27e+03  8.79e-02  1.00e+00
g[1]                 8.36e-01  1.09e-03  4.36e-02   7.65e-01   8.36e-01   9.10e-01  1.62e+03  1.12e-01  1.00e+00
g[2]                 3.37e-01  5.20e-04  2.02e-02   3.05e-01   3.37e-01   3.71e-01  1.51e+03  1.05e-01  1.00e+00
w[1]                 7.33e-01  1.53e-03  6.31e-02   6.24e-01   7.34e-01   8.35e-01  1.70e+03  1.18e-01  9.99e-01
w[2]                 1.41e-01  3.11e-04  1.26e-02   1.20e-01   1.41e-01   1.62e-01  1.66e+03  1.15e-01  9.99e-01
w[3]                 5.94e-01  4.87e-04  1.95e-02   5.61e-01   5.94e-01   6.25e-01  1.60e+03  1.11e-01  9.99e-01
d[1]                 6.68e-02  2.74e-04  1.07e-02   4.91e-02   6.69e-02   8.42e-02  1.54e+03  1.07e-01  1.00e+00
d[2]                 6.98e-01  2.17e-04  9.15e-03   6.83e-01   6.98e-01   7.13e-01  1.78e+03  1.23e-01  9.99e-01
d[3]                 2.35e-01  2.62e-04  1.10e-02   2.17e-01   2.35e-01   2.53e-01  1.78e+03  1.23e-01  1.00e+00

just curious why you’d do that when grep "^$var" would find what you want? is it not specific enough?

I have a bunch of “derived” parameters due to non-centered reparametrization of correlated parameters, such as p_mu, p_sd, p_L, p_err for the original parameter p. My primary interest is in the original parameters. The variation lets me filter out those derived parameters.

Hello @maedoc,

Do you know a simple way that I can capture R_hat in two decimal places?

The command bin/stansummary samples*.csv &> summary.txt reports R_hat only in one decimal place (1.0e+00). But I understand that the threshold for R_hat to watch for should be 1.01.

Run stansummary without arguments and it explains how to change the number of significant digits. You can also dump to a csv file which is useful.

1 Like

Thanks for the info. Appreciated.