Segmentation fault: `cmdstan::command` can not parallel using `std::thread`

Hi all,

I have tried a simple model with two different data files. cmdstan::command succeeds in generating two output files if they are called sequentially. However, it will exit with a Segmentation fault error if parallelized using std::thread. Please find the test case as below.

The model stan file, with the file name bernoulli.stan.

data { 
  int<lower=0> N; 
  int<lower=0,upper=1> y[N];
} 
parameters {
  real<lower=0,upper=1> theta;
} 
model {
  theta ~ beta(1,1);
  for (n in 1:N) 
    y[n] ~ bernoulli(theta);
}

Compile the *hpp file

$ stanc --o=bernoulli.hpp bernoulli.stan

Two data files.

$ cat data1.R 
N <- 10
y <- c(0,1,0,0,0,0,0,0,0,1)

$ cat data2.R 
N <- 10
y <- c(1,1,1,0,0,0,0,1,0,1)

The first main.cpp with sequential calling

#include <stan/services/error_codes.hpp>
#include <cmdstan/command.hpp>
#include <boost/exception/diagnostic_information.hpp>
#include <boost/exception_ptr.hpp>

int main(int argc, const char* argv[]) {
  try {
		int arc = 8;
		const char * arr1 [] =	{"PROGRAM", "sample", "data", "file=data1.R", "random", "seed=123", "output", "file=output1.txt"};
		const char * arr2 [] =	{"PROGRAM", "sample", "data", "file=data2.R", "random", "seed=123", "output", "file=output2.txt"};
		cmdstan::command(arc, arr1);
		cmdstan::command(arc, arr2);
    return 0;
  } catch (const std::exception& e) {
    std::cout << e.what() << std::endl;
    return stan::services::error_codes::SOFTWARE;
  }
}

Program build comand

$ root=cmdstan_root
$ g++ -std=c++1y -pthread -Wno-sign-compare -O3 -x c++ \
	-I "$root/src" \
	-I "$root/stan/src" \
	-I "$root/stan/lib/stan_math/" \
	-I "$root/stan/lib/stan_math/lib/eigen_3.3.3" \
	-I "$root/stan/lib/stan_math/lib/boost_1.69.0" \
	-I "$root/stan/lib/stan_math/lib/sundials_4.1.0/include" \
	-L "$root/stan/lib/stan_math/lib/sundials_4.1.0/lib" \
	-DBOOST_RESULT_OF_USE_TR1 \
	-DBOOST_NO_DECLTYPE \
	-DBOOST_DISABLE_ASSERTS \
	-DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION \
	-lsundials_nvecserial \
	-lsundials_nvecserial \
	-lsundials_cvodes \
	-lsundials_idas \
	bernoulli.hpp main.cpp \
	-o bernoulli

Corrent output files with the sequential calling

$ stansummary output1.txt 
Inference for Stan model: bernoulli_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took (0.0091) seconds, 0.0091 seconds total
Sampling took (0.014) seconds, 0.014 seconds total

                Mean     MCSE   StdDev     5%   50%   95%  N_Eff  N_Eff/s    R_hat
lp__            -7.3  3.6e-02  7.5e-01   -8.8  -7.0  -6.8    436    30888  1.0e+00
accept_stat__   0.86  7.2e-03  2.1e-01   0.37  0.96   1.0    859    60888  1.0e+00
stepsize__       1.1  4.0e-15  4.0e-15    1.1   1.1   1.1    1.0       71  1.0e+00
treedepth__      1.4  1.6e-02  5.0e-01    1.0   1.0   2.0   1013    71761  1.0e+00
n_leapfrog__     2.2  3.3e-02  9.8e-01    1.0   3.0   3.0    871    61698  1.0e+00
divergent__     0.00  0.0e+00  0.0e+00   0.00  0.00  0.00    500    35433     -nan
energy__         7.8  5.2e-02  1.1e+00    6.8   7.5   9.9    415    29425  1.0e+00
theta           0.25  5.4e-03  1.2e-01  0.080  0.24  0.46    484    34286  1.0e+00

Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at 
convergence, R_hat=1).

$ stansummary output2.txt 
Inference for Stan model: bernoulli_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took (0.0079) seconds, 0.0079 seconds total
Sampling took (0.013) seconds, 0.013 seconds total

                Mean     MCSE   StdDev    5%   50%   95%  N_Eff  N_Eff/s    R_hat
lp__            -8.8  4.0e-02  6.7e-01   -10  -8.6  -8.3    285    22163  1.0e+00
accept_stat__   0.92  3.4e-03  1.2e-01  0.67  0.97   1.0   1222    95188  1.0e+00
stepsize__       1.0  2.9e-15  2.9e-15   1.0   1.0   1.0    1.0       78  1.0e+00
treedepth__      1.4  1.7e-02  4.8e-01   1.0   1.0   2.0    838    65249  1.0e+00
n_leapfrog__     2.3  3.6e-02  9.7e-01   1.0   3.0   3.0    737    57421  1.0e+00
divergent__     0.00  0.0e+00  0.0e+00  0.00  0.00  0.00    500    38941     -nan
energy__         9.3  5.3e-02  9.3e-01   8.4   9.0    11    305    23733  1.0e+00
theta           0.51  7.2e-03  1.4e-01  0.28  0.52  0.74    363    28296  1.0e+00

Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at 
convergence, R_hat=1).

The second main.cpp with parallel calling using std::thread

#include <stan/services/error_codes.hpp>
#include <cmdstan/command.hpp>
#include <boost/exception/diagnostic_information.hpp>
#include <boost/exception_ptr.hpp>

#include <thread>

void worker(int argc, const char * argv[])
{
	cmdstan::command(argc, argv);
}

int main(int argc, const char* argv[]) {
	try {
		int arc = 8;
		const char * arr1 [] =	{"PROGRAM", "sample", "data", "file=data1.R", "random", "seed=123", "output", "file=output1.txt"};
		const char * arr2 [] =	{"PROGRAM", "sample", "data", "file=data2.R", "random", "seed=123", "output", "file=output2.txt"};

		int num_threads=2;

		std::thread t[num_threads];

		t[0] = std::thread(worker, arc, arr1);
		t[1] = std::thread(worker, arc, arr2);

		t[0].join();
		t[1].join();

		return 0;
	} catch (const std::exception& e) {
		std::cout << e.what() << std::endl;
		return stan::services::error_codes::SOFTWARE;
	}
}

Here, two threads are created for parallel model calling. However, it encounters a segmentation fault error.

$ ./bernoulli 
method = sample (Default)
  sample
    num_samples = 1000 (Default)
    num_warmup = 1000 (Default)
    save_warmup = 0 (Default)
    thin = 1 (Default)
    adapt
      engaged = 1 (Default)
method = sample (Default)
  sample
    num_samples = 1000 (Default)
    num_warmup = 1000 (Default)
    save_warmup = 0 (Default)
    thin = 1 (Default)
    adapt
      engaged = 1 (Default)
      gamma = 0.050000000000000003 (Default)
      delta = 0.80000000000000004 (Default)
      kappa = 0.75 (Default)
      t0 = 10 (Default)
      init_buffer = 75 (Default)
      term_buffer = 50 (Default)
      window = 25 (Default)
    algorithm = hmc (Default)
      hmc
        engine = nuts (Default)
          nuts
            max_depth = 10 (Default)
        metric = diag_e (Default)
      gamma = 0.050000000000000003 (Default)
      delta = 0.80000000000000004 (Default)
        metric_file =  (Default)
        stepsize = 1 (Default)
        stepsize_jitter = 0 (Default)
id = 0 (Default)
data
  file = data2.R
init = 2 (Default)
random
  seed = 123
output
  file = output2.txt
  diagnostic_file =  (Default)
  refresh = 100 (Default)

      kappa = 0.75 (Default)
      t0 = 10 (Default)
      init_buffer = 75 (Default)
      term_buffer = 50 (Default)
      window = 25 (Default)
    algorithm = hmc (Default)
      hmc
        engine = nuts (Default)
          nuts
            max_depth = 10 (Default)
        metric = diag_e (Default)
        metric_file =  (Default)
        stepsize = 1 (Default)
        stepsize_jitter = 0 (Default)
id = 0 (Default)
data
  file = data1.R
init = 2 (Default)
random
  seed = 123
output
  file = output1.txt
  diagnostic_file =  (Default)
  refresh = 100 (Default)


Gradient evaluation took 1e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Segmentation fault (core dumped)

The GDB trace info is as below.

$ gdb ./bernoulli 
GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./bernoulli...done.
run
(gdb) run
Starting program: /home/lijin/work/mwe/cmdstanmwe/example/c++_thread/bernoulli/thread/bernoulli 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6e85700 (LWP 23060)]
method = sample (Default)
  sample
    num_samples = 1000 (Default)
    num_warmup = 1000 (Default)
    save_warmup = 0 (Default)
    thin = 1 (Default)
    adapt
      engaged = 1 (Default)
      gamma = 0.050000000000000003 (Default)
      delta = 0.80000000000000004 (Default)
      kappa = 0.75 (Default)
      t0 = 10 (Default)
      init_buffer = 75 (Default)
      term_buffer = 50 (Default)
      window = 25 (Default)
    algorithm = hmc (Default)
      hmc
        engine = nuts (Default)
          nuts
            max_depth = 10 (Default)
        metric = diag_e (Default)
        metric_file =  (Default)
        stepsize = 1 (Default)
        stepsize_jitter = 0 (Default)
id = 0 (Default)
data
  file = data1.R
init = 2 (Default)
random
  seed = 123
output
  file = output1.txt
  diagnostic_file =  (Default)
  refresh = 100 (Default)

[New Thread 0x7ffff6684700 (LWP 23061)]

Gradient evaluation took 1e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
method = sample (Default)
  sample
    num_samples = 1000 (Default)
    num_warmup = 1000 (Default)
    save_warmup = 0 (Default)
    thin = 1 (Default)
    adapt
      engaged = 1 (Default)
      gamma = 0.050000000000000003 (Default)
      delta = 0.80000000000000004 (Default)
      kappa = 0.75 (Default)
      t0 = 10 (Default)
      init_buffer = 75 (Default)
      term_buffer = 50 (Default)
      window = 25 (Default)
    algorithm = hmc (Default)
      hmc
        engine = nuts (Default)
          nuts
            max_depth = 10 (Default)
        metric = diag_e (Default)
        metric_file =  (Default)
        stepsize = 1 (Default)
        stepsize_jitter = 0 (Default)
id = 0 (Default)
data
  file = data2.R
init = 2 (Default)
random
  seed = 123
output
  file = output2.txt
  diagnostic_file =  (Default)
  refresh = 100 (Default)


Thread 2 "bernoulli" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6e85700 (LWP 23060)]
0x000055555556d7c2 in stan::math::precomputed_gradients_vari::chain (this=0x5555558703b8) at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72
72	      varis_[i]->adj_ += adj_ * gradients_[i];
(gdb) where
#0  0x000055555556d7c2 in stan::math::precomputed_gradients_vari::chain (this=0x5555558703b8) at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72
#1  0x00005555555ddc3a in stan::math::grad (vi=<optimized out>) at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/lib/stan_math/stan/math/rev/core/grad.hpp:44
#2  stan::math::gradient<stan::model::model_functional<stan::model::model_base> > (f=..., x=..., fx=@0x7ffff6e83110: -4.6809100078512351, grad_fx=...)
    at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/lib/stan_math/stan/math/rev/mat/functor/gradient.hpp:50
#3  0x00005555555de17e in stan::model::gradient<stan::model::model_base> (model=..., x=..., f=@0x7ffff6e83110: -4.6809100078512351, grad_f=..., logger=...)
    at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/model/gradient.hpp:31
#4  0x00005555555de36e in stan::mcmc::base_hamiltonian<stan::model::model_base, stan::mcmc::diag_e_point, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >::update_potential_gradient (this=0x7ffff6e83140, z=..., logger=...)
    at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/mcmc/hmc/hamiltonians/base_hamiltonian.hpp:69
#5  0x00005555555e3e89 in stan::mcmc::expl_leapfrog<stan::mcmc::diag_e_metric<stan::model::model_base, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > > >::update_q (logger=..., epsilon=-1.1153728433577428, hamiltonian=..., z=..., this=0x7ffff6e83138)
    at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/mcmc/hmc/integrators/expl_leapfrog.hpp:27
#6  stan::mcmc::base_leapfrog<stan::mcmc::diag_e_metric<stan::model::model_base, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > > >::evolve (this=0x7ffff6e83138, z=..., hamiltonian=warning: RTTI symbol not found for class 'stan::mcmc::diag_e_metric<stan::model::model_base, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >'
..., epsilon=-1.1153728433577428, logger=...)
    at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/mcmc/hmc/integrators/base_leapfrog.hpp:24
#7  0x000055555560b73b in stan::mcmc::base_nuts<stan::model::model_base, stan::mcmc::diag_e_metric, stan::mcmc::expl_leapfrog, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >::build_tree (this=0x7ffff6e830e0, depth=<optimized out>, z_propose=..., p_sharp_left=..., 
    p_sharp_right=..., rho=..., H0=7.1728643944582391, sign=-1, n_leapfrog=@0x7ffff6e82a24: 0, log_sum_weight=@0x7ffff6e82a80: -inf, sum_metro_prob=@0x7ffff6e82a28: 0, logger=...)
    at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/mcmc/hmc/nuts/base_nuts.hpp:215
#8  0x000055555560c361 in stan::mcmc::base_nuts<stan::model::model_base, stan::mcmc::diag_e_metric, stan::mcmc::expl_leapfrog, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >::transition (this=0x7ffff6e830e0, init_sample=..., logger=...)
    at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/mcmc/hmc/nuts/base_nuts.hpp:122
#9  0x000055555560c4b6 in stan::mcmc::adapt_diag_e_nuts<stan::model::model_base, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >::transition (this=<optimized out>, init_sample=..., logger=..., this=<optimized out>)
    at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/mcmc/hmc/nuts/adapt_diag_e_nuts.hpp:27
#10 0x000055555559134a in stan::services::util::generate_transitions<stan::model::model_base, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > > (sampler=warning: RTTI symbol not found for class 'stan::mcmc::adapt_diag_e_nuts<stan::model::model_base, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >'
..., num_iterations=1000, finish=2000, num_thin=1, refresh=100, save=false, mcmc_writer=..., init_s=..., model=..., 
    base_rng=..., callback=..., logger=..., warmup=true, start=0) at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/services/util/generate_transitions.hpp:71
#11 0x00005555555e7bf5 in stan::services::util::run_adaptive_sampler<stan::mcmc::adapt_diag_e_nuts<stan::model::model_base, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >, stan::model::model_base, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > > (diagnostic_writer=..., sample_writer=..., logger=..., interrupt=..., rng=..., 
    save_warmup=false, refresh=100, num_thin=1, num_samples=1000, num_warmup=1000, cont_vector=std::vector of length 1, capacity 1 = {...}, model=..., sampler=warning: RTTI symbol not found for class 'stan::mcmc::adapt_diag_e_nuts<stan::model::model_base, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >'
..., this=<optimized out>)
    at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/services/util/run_adaptive_sampler.hpp:69
#12 stan::services::sample::hmc_nuts_diag_e_adapt<stan::model::model_base> (model=..., init=..., init_inv_metric=..., random_seed=<optimized out>, chain=<optimized out>, init_radius=<optimized out>, 
    num_warmup=1000, num_samples=1000, num_thin=1, save_warmup=false, refresh=100, stepsize=<optimized out>, stepsize_jitter=<optimized out>, max_depth=10, delta=<optimized out>, gamma=<optimized out>, 
    kappa=<optimized out>, t0=<optimized out>, init_buffer=75, term_buffer=50, window=25, interrupt=..., logger=..., init_writer=..., sample_writer=..., diagnostic_writer=...)
    at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/services/sample/hmc_nuts_diag_e_adapt.hpp:106
#13 0x0000555555598eca in stan::services::sample::hmc_nuts_diag_e_adapt<stan::model::model_base> (diagnostic_writer=..., sample_writer=..., init_writer=..., logger=..., interrupt=..., window=25, term_buffer=50, 
    init_buffer=75, t0=10, kappa=0.75, gamma=0.050000000000000003, delta=0.80000000000000004, max_depth=10, stepsize_jitter=0, stepsize=1, refresh=100, save_warmup=false, num_thin=1, num_samples=1000, 
    num_warmup=1000, init_radius=2, chain=0, random_seed=123, init=..., model=...) at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/services/sample/hmc_nuts_diag_e_adapt.hpp:173
#14 cmdstan::command (argc=<optimized out>, argv=<optimized out>) at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/src/cmdstan/command.hpp:523
#15 0x00007ffff7b0966f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#16 0x00007ffff727e6db in start_thread (arg=0x7ffff6e85700) at pthread_create.c:463
#17 0x00007ffff6fa788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

May I ask how to call cmdstan::command instances in parallel? Thank you.

  • Operating System: Ubuntu 18.04
  • CmdStan Version: 2.20.0
  • Compiler/Toolkit: g++ 7.4.0

I’m not surprised std::thread does not (yet) work in that context. The CmdStan users guide provides an example of simply launching different processes, like

for i in {1..4}
do
./eight_schools sample adapt delta=0.9 \
random seed=12345 id=$i data \
file=eight_schools.data.R \
output file=eight_schools$i.csv &
done

…and you would have to instantiate the ad tape in each thread. To do that, include in the worker function an instance of Stan::math::ChainablStack before calling cmdstan.

Thank you for the information. I am expecting parallel within one program instead of external parallel in a shell. Thank you anyway.

Thank you for the direction. I tried to add one Stan::math::ChainablStack instance before calling cmdstan. Yet, the segmentation fault persists using std::thread. I also tried to add a Stan::math::ChainablStack instance using std::async, the segfault error still persists. Please find the test case and the error message for the version using std::thread as below.

main.cpp

#include <stan/services/error_codes.hpp>
#include <cmdstan/command.hpp>
#include <boost/exception/diagnostic_information.hpp>
#include <boost/exception_ptr.hpp>

#include <stan/math/rev/mat.hpp>

#include <thread>

void worker(int argc, const char * argv[])
{
	stan::math::ChainableStack thread_instance;
	cmdstan::command(argc, argv);
}

int main(int argc, const char* argv[]) {
	try {
		int arc = 8;
		const char * arr1 [] =	{"PROGRAM", "sample", "data", "file=data1.R", "random", "seed=123", "output", "file=output1.txt"};
		const char * arr2 [] =	{"PROGRAM", "sample", "data", "file=data2.R", "random", "seed=123", "output", "file=output2.txt"};

		int num_threads=2;

		std::thread t[num_threads];

		t[0] = std::thread(worker, arc, arr1);
		t[1] = std::thread(worker, arc, arr2);

		t[0].join();
		t[1].join();

		return 0;
	} catch (const std::exception& e) {
		std::cout << e.what() << std::endl;
		return stan::services::error_codes::SOFTWARE;
	}
}

Error message

$ ./bernoulli 
method = sample (Default)
  sample
    num_samples = 1000 (Default)
    num_warmup = 1000 (Default)
    save_warmup = 0 (Default)
    thin = 1 (Default)
    adapt
      engaged = 1 (Default)
      gamma = 0.050000000000000003 (Default)
      delta = 0.80000000000000004 (Default)
      kappa = 0.75 (Default)
      t0 = 10 (Default)
      init_buffer = 75 (Default)
      term_buffer = 50 (Default)
      window = 25 (Default)
    algorithm = hmc (Default)
      hmc
        engine = nuts (Default)
          nuts
            max_depth = 10 (Default)
        metric = diag_e (Default)
        metric_file =  (Default)
        stepsize = 1 (Default)
        stepsize_jitter = 0 (Default)
id = 0 (Default)
data
  file = data2.R
init = 2 (Default)
random
  seed = 123
output
  file = output2.txt
  diagnostic_file =  (Default)
  refresh = 100 (Default)


Gradient evaluation took 1.1e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
method = sample (Default)
  sample
    num_samples = 1000 (Default)
    num_warmup = 1000 (Default)
    save_warmup = 0 (Default)
    thin = 1 (Default)
    adapt
      engaged = 1 (Default)
      gamma = 0.050000000000000003 (Default)
      delta = 0.80000000000000004 (Default)
Iteration:  100 / 2000 [  5%]  (Warmup)
      kappa = 0.75 (Default)
      t0 = 10 (Default)
      init_buffer = 75 (Default)
Iteration:  200 / 2000 [ 10%]  (Warmup)
      term_buffer = 50 (Default)
      window = 25 (Default)
    algorithm = hmc (Default)
      hmc
        engine = nuts (Default)
          nuts
Iteration:  300 / 2000 [ 15%]  (Warmup)
            max_depth = 10 (Default)
        metric = diag_e (Default)
Iteration:  400 / 2000 [ 20%]  (Warmup)
        metric_file =  (Default)
        stepsize = 1 (Default)
        stepsize_jitter = 0 (Default)
id = 0 (Default)
data
  file = data1.R
init = 2 (Default)
random
  seed = 123
Iteration:  500 / 2000 [ 25%]  (Warmup)
output
  file = output1.txt
  diagnostic_file =  (Default)
  refresh = 100 (Default)

Segmentation fault (core dumped)

The GDB trace info:

$ gdb ./bernoulli 
GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./bernoulli...done.
(gdb) run
Starting program: ./bernoulli 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6e85700 (LWP 29432)]
[New Thread 0x7ffff6684700 (LWP 29433)]
method = sample (Default)
  sample
    num_samples = 1000 (Default)
    num_warmup = 1000 (Default)
    save_warmup = 0 (Default)
    thin = 1 (Default)
    adapt
      engaged = 1 (Default)
      gamma = 0.050000000000000003 (Default)
      delta = 0.80000000000000004 (Default)
      kappa = 0.75 (Default)
      t0 = 10 (Default)
      init_buffer = 75 (Default)
      term_buffer = 50 (Default)
      window = 25 (Default)
    algorithm = hmc (Default)
      hmc
        engine = nuts (Default)
          nuts
            max_depth = 10 (Default)
        metric = diag_e (Default)
        metric_file =  (Default)
        stepsize = 1 (Default)
        stepsize_jitter = 0 (Default)
id = 0 (Default)
data
  file = data1.R
init = 2 (Default)
random
  seed = 123
output
  file = output1.txt
  diagnostic_file =  (Default)
  refresh = 100 (Default)

method = sample (Default)
  sample
    num_samples = 1000 (Default)
    num_warmup = 1000 (Default)
    save_warmup = 0 (Default)
    thin = 1 (Default)
    adapt
      engaged = 1 (Default)
      gamma = 0.050000000000000003 (Default)
      delta = 0.80000000000000004 (Default)
      kappa = 0.75 (Default)
      t0 = 10 (Default)
      init_buffer = 75 (Default)
      term_buffer = 50 (Default)
      window = 25 (Default)
    algorithm = hmc (Default)
      hmc
        engine = nuts (Default)
          nuts
            max_depth = 10 (Default)
        metric = diag_e (Default)
        metric_file =  (Default)
        stepsize = 1 (Default)
        stepsize_jitter = 0 (Default)
id = 0 (Default)
data
  file = data2.R
init = 2 (Default)
random
  seed = 123
output
  file = output2.txt
  diagnostic_file =  (Default)
  refresh = 100 (Default)


Gradient evaluation took 1.1e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)

Gradient evaluation took 5e-06 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds.
Adjust your expectations accordingly!



Thread 2 "bernoulli" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6e85700 (LWP 29432)]
0x0000000000000000 in ?? ()
(gdb) where
#0  0x0000000000000000 in ?? ()
#1  0x00005555555dddaa in stan::math::grad (vi=<optimized out>) at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/lib/stan_math/stan/math/rev/core/grad.hpp:44
#2  stan::math::gradient<stan::model::model_functional<stan::model::model_base> > (f=..., x=..., fx=@0x7ffff6e830d0: -28.573222549382347, grad_fx=...)
    at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/lib/stan_math/stan/math/rev/mat/functor/gradient.hpp:50
#3  0x00005555555de2ee in stan::model::gradient<stan::model::model_base> (model=..., x=..., f=@0x7ffff6e830d0: -28.573222549382347, grad_f=..., logger=...)
    at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/model/gradient.hpp:31
#4  0x00005555555de4de in stan::mcmc::base_hamiltonian<stan::model::model_base, stan::mcmc::diag_e_point, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >::update_potential_gradient (
    this=0x7ffff6e83100, z=..., logger=...) at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/mcmc/hmc/hamiltonians/base_hamiltonian.hpp:69
#5  0x00005555555e3ff9 in stan::mcmc::expl_leapfrog<stan::mcmc::diag_e_metric<stan::model::model_base, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > > >::update_q (logger=..., epsilon=-2, 
    hamiltonian=..., z=..., this=0x7ffff6e830f8) at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/mcmc/hmc/integrators/expl_leapfrog.hpp:27
#6  stan::mcmc::base_leapfrog<stan::mcmc::diag_e_metric<stan::model::model_base, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > > >::evolve (this=0x7ffff6e830f8, z=..., hamiltonian=warning: RTTI symbol not found for class 'stan::mcmc::diag_e_metric<stan::model::model_base, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >'
..., 
    epsilon=-2, logger=...) at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/mcmc/hmc/integrators/base_leapfrog.hpp:24
#7  0x000055555560b70b in stan::mcmc::base_nuts<stan::model::model_base, stan::mcmc::diag_e_metric, stan::mcmc::expl_leapfrog, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >::build_tree (
    this=0x7ffff6e830a0, depth=<optimized out>, z_propose=..., p_sharp_left=..., p_sharp_right=..., rho=..., H0=21.155288285582508, sign=-1, n_leapfrog=@0x7ffff6e829e4: 0, 
    log_sum_weight=@0x7ffff6e82a40: -inf, sum_metro_prob=@0x7ffff6e829e8: 0, logger=...)
    at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/mcmc/hmc/nuts/base_nuts.hpp:215
#8  0x000055555560c331 in stan::mcmc::base_nuts<stan::model::model_base, stan::mcmc::diag_e_metric, stan::mcmc::expl_leapfrog, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >::transition (
    this=0x7ffff6e830a0, init_sample=..., logger=...) at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/mcmc/hmc/nuts/base_nuts.hpp:122
#9  0x000055555560c486 in stan::mcmc::adapt_diag_e_nuts<stan::model::model_base, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >::transition (this=<optimized out>, init_sample=..., 
    logger=..., this=<optimized out>) at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/mcmc/hmc/nuts/adapt_diag_e_nuts.hpp:27
#10 0x000055555559134a in stan::services::util::generate_transitions<stan::model::model_base, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > > (sampler=warning: RTTI symbol not found for class 'stan::mcmc::adapt_diag_e_nuts<stan::model::model_base, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >'
..., num_iterations=1000, 
    finish=2000, num_thin=1, refresh=100, save=false, mcmc_writer=..., init_s=..., model=..., base_rng=..., callback=..., logger=..., warmup=true, start=0)
    at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/services/util/generate_transitions.hpp:71
#11 0x00005555555e7d65 in stan::services::util::run_adaptive_sampler<stan::mcmc::adapt_diag_e_nuts<stan::model::model_base, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >, stan::model::model_base, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > > (diagnostic_writer=..., sample_writer=..., logger=..., interrupt=..., rng=..., save_warmup=false, refresh=100, num_thin=1, 
    num_samples=1000, num_warmup=1000, cont_vector=std::vector of length 1, capacity 1 = {...}, model=..., sampler=warning: RTTI symbol not found for class 'stan::mcmc::adapt_diag_e_nuts<stan::model::model_base, boost::random::additive_combine_engine<boost::random::linear_congruential_engine<unsigned int, 40014u, 0u, 2147483563u>, boost::random::linear_congruential_engine<unsigned int, 40692u, 0u, 2147483399u> > >'
..., this=<optimized out>)
    at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/services/util/run_adaptive_sampler.hpp:69
#12 stan::services::sample::hmc_nuts_diag_e_adapt<stan::model::model_base> (model=..., init=..., init_inv_metric=..., random_seed=<optimized out>, chain=<optimized out>, 
    init_radius=<optimized out>, num_warmup=1000, num_samples=1000, num_thin=1, save_warmup=false, refresh=100, stepsize=<optimized out>, stepsize_jitter=<optimized out>, 
    max_depth=10, delta=<optimized out>, gamma=<optimized out>, kappa=<optimized out>, t0=<optimized out>, init_buffer=75, term_buffer=50, window=25, interrupt=..., 
    logger=..., init_writer=..., sample_writer=..., diagnostic_writer=...)
    at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/services/sample/hmc_nuts_diag_e_adapt.hpp:106
#13 0x0000555555598eca in stan::services::sample::hmc_nuts_diag_e_adapt<stan::model::model_base> (diagnostic_writer=..., sample_writer=..., init_writer=..., logger=..., 
    interrupt=..., window=25, term_buffer=50, init_buffer=75, t0=10, kappa=0.75, gamma=0.050000000000000003, delta=0.80000000000000004, max_depth=10, stepsize_jitter=0, 
    stepsize=1, refresh=100, save_warmup=false, num_thin=1, num_samples=1000, num_warmup=1000, init_radius=2, chain=0, random_seed=123, init=..., model=...)
    at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/stan/src/stan/services/sample/hmc_nuts_diag_e_adapt.hpp:173
#14 cmdstan::command (argc=<optimized out>, argv=<optimized out>) at /home/lijin/work/mwe/cmdstanmwe/install/cmdstan/src/cmdstan/command.hpp:523
#15 0x000055555559f287 in worker (argc=8, argv=0x7fffffffdab0) at main.cpp:13
#16 0x00007ffff7b0966f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#17 0x00007ffff727e6db in start_thread (arg=0x7ffff6e85700) at pthread_create.c:463
#18 0x00007ffff6fa788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Do you mind suggesting a potential solution to fix this segfault error? Thank you.

I’m going to change the tag of this to developer because it’s about low-level C++.

Stan itself needs to be set up with the right environment variables to build something threadsafe at the autodiff level. I’m not sure how to do that, but it must be in our doc somewhere. I don’t think there’s otherwise anything static going on that would cause issues within CmdStan itself.

If you’re trying to use the same model in multiple threads, that may cause problems at some point. I’m not sure what the design guarantees are there. @seantalts may know.

@Bob_Carpenter Hi Bob, thank you for your clarification. I am also feeling that I need to set up additional environment variables besides Stan::math::ChainablStack that @wds15 mentioned. However, I can not find relevant documents or examples. I also checked static variables in both CmdStan and Stan libraries, but I did not find clues for the segfault.

@seantalts Hi Sean, may I ask do you have any chance to help suggest what are necessary environment variables for multiple threads using the same model in CmdStan? Thank you.

Hey @lijinbio, we don’t support threading in Stan except via the use of map_rect, which does indeed need a compile flag to work in a threaded way. It may change the stack such that you can use it in the way you want to use it, but I’d recommend just going with the supported route instead if possible.

Richard McElreath put together a nice tutorial here: https://github.com/rmcelreath/cmdstan_map_rect_tutorial

Hi @seantalts, thank you for your quick reply. I have read this nice tutorial before. But I think I can not directly adapt it to my threading using the same model.

I have a program to run a single (and simple, can be finished in 0.1 second) model on over 20 million different datasets. To make the running possible, I need to implement parallelization so that we can use our many CPU cores. For now, I implemented it in the shell level, but it is too slow to be deployed as a software. So I hope I can call cmdstan::command function in parallel so that we can save much more running time compared to the shell level multiple threading.

Unfortunately, sounds like we don’t have a direct solution to the threading with the same model in Stan. May I ask what features or requirements prevents Stan from threading for the same model? Is it possible for me to meet these requirements to call cmdstan::commands in parallel? Thank you.

If you’re running the same model on a bunch of data sets, just spawn separate processes, right?

Hi @seantalts, yes, each data set is executed in a separate Stan process. My C++ program calls an external program in parallel, which implements the model by calling cmdstan::command. However, much time is wasted in the process fork and result collection in system calls. For example, only ~5 CPUs are occupied when the main C++ program has 50 threads in parallel. Therefore, I want to directly call cmdstan::command functions in parallel within the program. May I ask is it possible? Thank you.

We haven’t tested that - I wouldn’t trust that it works correctly. You can try the flags in the tutorial if you want but they aren’t intended for this use case.

Is this on Windows? Linux tends to have much cheaper fork() if that’s an option.

It may be annoying but I would say there is probably some way you could express this as smaller number of Stan models executions using map_rect. Can you share your model?

Hi @seantalts, I may try the flags in the tutorial. Our program is running in Linux.

Sure. Please find my model files as below.

$ cat betabinFit.stan 
data {
  int<lower=0> Jx; // amount of items
  int countx[Jx];
  int totalx[Jx];
}
parameters {
  real<lower=0,upper=1> mux;
  real<lower=0.001> ssx;
  real<lower=0,upper=1> probx[Jx];
}
model {
  mux ~ beta(1, 1);
  ssx ~ gamma(0.01, 0.01);
  probx ~ beta(mux*ssx, (1-mux)*ssx);
  countx ~ binomial(totalx, probx);
}

$ cat data.R 
Jx <- 3
countx <- c(2, 2, 2)
totalx <- c(4, 4, 4)

$ cat betabinFit.cpp 
int main() {
	int arc = 8;
	const char* arr[] = {"PROGRAM", "sample", "data", "file=data.R", "random", "seed=123", "output", "file=output.txt"};

	try {
		cmdstan::command(arc, arr);
		return 0;
	} catch (const std::exception& e) {
		std::cout << e.what() << std::endl;
		return stan::services::error_codes::SOFTWARE;
	}
	return 0;
}

I always want to understand how could we use map_rect to speed up the Stan model executions. Thank you for your help.

In fact, I also have several other questions w.r.t. the cmdstan::command interface.

  • If cmdstan::command in new Stan version (e.g. 2.20.0) supports input data from istringstream. This interface was implemented in very early versions, such as the interface int command(int argc, const char* argv[], std::istringstream & ifs) in the file stan/common/command.hpp of the Stan 2.6.0. But I didn’t find a similar interface in 2.20.0. Do we have some other ways to read input from C++ stream objects instead of a file in the hard drive?

  • How could I suppress info output in stdout in 2.20.0? When the task is to call cmdstan::command many times, these stdout information can be a burden, and I want to waive them.

I appreciate if you can write some quick answers. Sorry to bother if this topic is not good for cmdstan interface questions, and I may ask these questions in separate topics later. Many thanks for your help.

I don’t know the answers to your specific CmdStan questions, sorry! But as far as turning that into a map_rect model - I think it’s possible, and even easy if Jx is the same for every data set you have. If it’s not the same you’ll have to look into how to deal with raggedness, which can be a pain. Basically you can 0-pad and pass along indices for the end points. Though then you’d be fitting a “joint” model over all 20M data sets you have, but all of them should be independent so I’d expect that the performance could be better than parallelizing.

Actually, it might be easiest to give RStan or PyStan a shot - they load the model executable into memory and can fork a pool of worker processes to handle the data. That should get rid of any burden of forking.

Hi @seantalts, many thanks for your information. My datasets may have different Jx, but I can start from a subset with the same Jx. Then I may add the 0-paddings with all the data sets. It is great to learn to use the map_rect model in my case. If it works, I will not have a threading problem anymore. I will also test RStan to see if it will be faster than the C++ implementation in CmdStan. Thank you very much.

Given the way the model’s currently designed (to be immutable), it should technically be possible to use the same model in multiple threads. But, that would involve the same data. Every time you need to add new data, you need to create a new object.

The problem we have under all of this is autodiff, which is set up to be single threaded. If we configure map_rect to use threads, it makes the autodiff static memory thread local.

So it could potentially work, but we haven’t investigated this use case at all. I imagine what you really need is a single model with mutable data you can reuse, as that’ll avoid the I/O round trip, which is also expensive in bulk.