Run multiple chains with cmdstan as batch jobs using qsub on a cluster

tlyim · June 2, 2019, 1:44pm

Operating System: linux cluster
Interface Version: cmdstan
Compiler/Toolkit: qsub, PBS

Just want to keep a record about the captioned in case others would need to do this.

I attempted to run multiple chains following p.28 of the cmdstan-guide-2.19.1.pdf but the chains would run sequentially.

Following @bbbales2’s suggestion, I now run the multiple chains in parallel as batch jobs on the cluster:

In the .sh shell script to be submitted with the qsub command, put down

#PBS -t 1-4

cd ~/cmdstan-2.19.1

time  ../SSM0.5.2dev sample algorithm=hmc metric=dense_e adapt delta=0.8 \
id=$PBS_ARRAYID data file=../SSM_data.dat output file=samples$PBS_ARRAYID.csv

The -t 1-4 option of the PBS command schedules an array of jobs (1 to 4). The different job numbers will be captured by the environment variable $PBS_ARRAYID as explained in this page.
The second line changes the directory to cmdstan directory that must be the location to use the sampling command in the third line
the inclusion of time in the third line keeps track of the wall time. I need to run the compiled version of the .stan file (in my case SSM0.5.2dev) located in my home directory, which is a level up from the cmdstan directory.
(In my case only, I need to use the metric=dense_e option, instead of the default choice metric=diag_e.)
The output file=samples$PBS_ARRAYID.csv specifies that the sampling output files would be samples1.csv, ... samples4.csv located in the cmdstan directory (a path could have been added to place the output files elsewhere).

Note: Different clusters use different approaches to specify an array of batch jobs. See this post for the case of another cluster. See also this reply to the post.

yizhang · June 2, 2019, 11:06pm

what’s the reason to manually time it instead of using the csv output?

tviti · June 3, 2019, 12:13am

It’s been a while since I ran cmdstan on our cluster w/ multithreading, but when I was it looked like the timed outputs in the CSVs weren’t matching up with the actual walltime for each chain (CSV time was much longer than actual runtime), so I was using time to time them. Maybe OP is having a similar issue?

tlyim · June 3, 2019, 12:15am

Indeed not necessary. The time part came from McElreath’s use of it in this post on a related topic. I didn’t realize the csv output also reports the info.

Do you happen to know whether the command

bin/stansummary samples*.csv

has an option to report only the summary for selected parameters like rstan's

print(fit, c("alpha", "beta", ... ))

?

I am struggling to find a convenient way to examine the output from cmdstan.

maedoc · June 3, 2019, 1:30pm

You could send the summary output to a text file and then grep for the interesting parts

bin/stansummary samples*.csv &> summary.txt
for var in foo bar theta gamma; do
  grep "^$var" summary.txt
done

For the sample CSV files,

# find column indices
i=$(grep "^#lp__" output.csv | tr , \\n | nl | grep alpha | cut -f1 | column -t | tr \\n ,)

# print lp and columns for columns matching alpha
grep -v "^#" output.csv | cut -f${i}1 -d,

When testing a new model, I’ve found it useful to save_warmup=1 and run watch with the command

 tail -n20 output.csv | grep -v '^#' | cut -d, -f1-6 | tr , '\t' | column -t

to see stepsize & treedepth etc live

tlyim · June 4, 2019, 2:53am

Thanks for your guidance. The following variation of your suggestion serves my purpose:

bin/stansummary samples*.csv --sig_figs=3 &> summary.txt
grep -P "Inference" summary.txt; \
grep -P "iterations saved" summary.txt; \
grep -P "Warmup" summary.txt; \
grep -P "Sampling" summary.txt; \
grep -P "Mean" summary.txt; for \
var in lp__ accept_stat__ stepsize__ treedepth__ n_leapfrog__ divergent__ energy__ \
sd_y mu_u1 mu_alpha beta theta sd_season mu_season p g w d; do \
  grep -P "^$var[[:space:]]|^$var\[" summary.txt;  
done

This reports:

Inference for Stan model: SSM0_5_2dev_model
4 chains: each with iter=(400,400,400,400); warmup=(0,0,0,0); thin=(1,1,1,1); 1600 iterations saved.
Warmup took (7133, 7125, 7429, 6727) seconds, 7.89 hours total
Sampling took (3443, 3584, 3725, 3661) seconds, 4.00 hours total
                         Mean      MCSE    StdDev         5%        50%        95%     N_Eff   N_Eff/s     R_hat
lp__                 2.07e+04  5.76e-01  1.39e+01   2.07e+04   2.07e+04   2.07e+04  5.82e+02  4.04e-02  1.00e+00
accept_stat__        9.62e-01  2.75e-03  8.83e-02   8.48e-01   9.88e-01   9.99e-01  1.03e+03  7.16e-02  1.01e+00
stepsize__           1.79e-02  6.50e-04  9.20e-04   1.69e-02   1.79e-02   1.94e-02  2.00e+00  1.39e-04  5.60e+13
treedepth__          9.00e+00  5.04e-03  1.98e-01   9.00e+00   9.00e+00   9.00e+00  1.55e+03  1.08e-01  1.00e+00
n_leapfrog__         5.46e+02  3.59e+00  1.34e+02   5.11e+02   5.11e+02   1.02e+03  1.39e+03  9.63e-02  1.02e+00
divergent__          5.00e-03      -nan  7.06e-02   0.00e+00   0.00e+00   0.00e+00      -nan      -nan  9.99e-01
energy__            -2.05e+04  8.28e-01  1.91e+01  -2.06e+04  -2.05e+04  -2.05e+04  5.30e+02  3.68e-02  1.01e+00
sd_y                 8.01e-02  1.35e-05  5.74e-04   7.91e-02   8.01e-02   8.10e-02  1.80e+03  1.25e-01  9.98e-01
mu_u1                1.02e-01  1.99e-04  8.99e-03   8.65e-02   1.02e-01   1.16e-01  2.03e+03  1.41e-01  1.00e+00
mu_alpha             4.17e-02  4.61e-05  1.86e-03   3.87e-02   4.18e-02   4.47e-02  1.63e+03  1.13e-01  9.99e-01
beta                 5.93e-01  1.87e-04  7.52e-03   5.80e-01   5.93e-01   6.05e-01  1.62e+03  1.12e-01  1.00e+00
theta                1.53e-01  7.81e-05  3.24e-03   1.48e-01   1.53e-01   1.59e-01  1.71e+03  1.19e-01  1.00e+00
sd_season            1.02e-01  1.03e-04  4.25e-03   9.46e-02   1.01e-01   1.09e-01  1.70e+03  1.18e-01  1.00e+00
mu_season[1]        -1.26e-01  2.23e-04  1.05e-02  -1.44e-01  -1.26e-01  -1.09e-01  2.20e+03  1.53e-01  1.00e+00
mu_season[2]        -7.06e-02  2.23e-04  1.02e-02  -8.78e-02  -7.04e-02  -5.45e-02  2.07e+03  1.44e-01  1.00e+00
mu_season[3]         1.46e-01  2.42e-04  1.03e-02   1.28e-01   1.46e-01   1.63e-01  1.82e+03  1.27e-01  1.00e+00
p[1]                 6.45e-01  1.07e-03  3.83e-02   5.98e-01   6.38e-01   7.19e-01  1.27e+03  8.81e-02  1.00e+00
p[2]                 6.27e-01  1.20e-04  5.35e-03   6.18e-01   6.27e-01   6.36e-01  1.98e+03  1.37e-01  9.99e-01
p[3]                 5.97e-01  1.54e-03  5.48e-02   5.30e-01   5.87e-01   7.04e-01  1.27e+03  8.79e-02  1.00e+00
g[1]                 8.36e-01  1.09e-03  4.36e-02   7.65e-01   8.36e-01   9.10e-01  1.62e+03  1.12e-01  1.00e+00
g[2]                 3.37e-01  5.20e-04  2.02e-02   3.05e-01   3.37e-01   3.71e-01  1.51e+03  1.05e-01  1.00e+00
w[1]                 7.33e-01  1.53e-03  6.31e-02   6.24e-01   7.34e-01   8.35e-01  1.70e+03  1.18e-01  9.99e-01
w[2]                 1.41e-01  3.11e-04  1.26e-02   1.20e-01   1.41e-01   1.62e-01  1.66e+03  1.15e-01  9.99e-01
w[3]                 5.94e-01  4.87e-04  1.95e-02   5.61e-01   5.94e-01   6.25e-01  1.60e+03  1.11e-01  9.99e-01
d[1]                 6.68e-02  2.74e-04  1.07e-02   4.91e-02   6.69e-02   8.42e-02  1.54e+03  1.07e-01  1.00e+00
d[2]                 6.98e-01  2.17e-04  9.15e-03   6.83e-01   6.98e-01   7.13e-01  1.78e+03  1.23e-01  9.99e-01
d[3]                 2.35e-01  2.62e-04  1.10e-02   2.17e-01   2.35e-01   2.53e-01  1.78e+03  1.23e-01  1.00e+00

maedoc · June 4, 2019, 8:03am

just curious why you’d do that when grep "^$var" would find what you want? is it not specific enough?

tlyim · June 4, 2019, 2:47pm

I have a bunch of “derived” parameters due to non-centered reparametrization of correlated parameters, such as p_mu, p_sd, p_L, p_err for the original parameter p. My primary interest is in the original parameters. The variation lets me filter out those derived parameters.

tlyim · June 7, 2019, 9:17pm

Hello @maedoc,

Do you know a simple way that I can capture R_hat in two decimal places?

The command bin/stansummary samples*.csv &> summary.txt reports R_hat only in one decimal place (1.0e+00). But I understand that the threshold for R_hat to watch for should be 1.01.

maedoc · June 8, 2019, 9:21am

Run stansummary without arguments and it explains how to change the number of significant digits. You can also dump to a csv file which is useful.

tlyim · June 8, 2019, 1:44pm

Thanks for the info. Appreciated.

Topic		Replies	Views
Four chains vs four jobs General cmdstan	28	215	June 19, 2024
Using Stan on a computing cluster. Any advice? CmdStan	20	5058	January 10, 2019
Running model in a HPC and would like to save intermediate outputs Modeling	19	2551	July 12, 2021
Stuck at Warmup iteration with no error : CmdStanR CmdStan techniques , fitting-issues	48	3154	April 21, 2020
Stan on computing cluster: strange results CmdStan	11	1602	June 8, 2018

Run multiple chains with cmdstan as batch jobs using qsub on a cluster

Related topics