Parallel with cmdstanr?

Thanks for the awesome new interface for R users - so speedy!

I looked through the docs for instructions on running in parallel… cmdstanpy has the cores argument that I assume will be coming to cmdstanr too. Any guidance until then, or shall I just wait patiently?

3 Likes

Hey @nerutenbeck thanks for checking out cmdstanr!

You’re right that it’s not implemented yet. Definitely on the list of things that @rok_cesnovar and I will prioritize for an alpha or beta release sometime soon.

I haven’t tried this but I think the best option until then is something like this:

  • Use mclapply on Mac or Linux (maybe parLapply on Windows) from the parallel package (or use a different parallelization package for R) to run the same model 4 times in parallel

  • each element of the resulting list should be it’s own CmdStanMCMC object. Then make a new list where each element is a stanfit object instead of a CmdStanMCMC object (you can create a stanfit object from each of the CmdStanMCMC object separately using the technique at the very end of examples section here: Create a new CmdStanModel object — cmdstan_model • cmdstanr

  • use rstan::sflist2stanfit to convert the list of stanfit objects to a single one

  • use the resulting single stanfit object as you normally would if using rstan

That sounds like a lot but really is probably just a few lines of code.

The approach described above is not how we’ll implement parallelization in cmdstanr but it should work until we have an implementation. Once we add parallelization natively you’ll just get back a single CmdStanMCMC object containing all the chains and won’t need to deal with stanfit objects at all.

P.S. I think I told you this at StanCon, but in case I didn’t, I’m glad to see Stan getting used up there in Maine. I used to live there but it’s been a long time since I’ve been back!

5 Likes

Cool, cool. Thanks for your work!

Just as a follow-up for anybody who tries this - running the following works for me:

modfile <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.stan")
model <- cmdstan_model(modfile)
data_list <- list(N = 10, y =c(0,1,0,0,0,0,0,0,0,1))

fit_list <- parallel::mclapply(
  1:4,
  model$sample,
  data = data_list,
  num_chains = 1,
  num_samples = 1e3,
  mc.cores = 4
)

fit_sflist <- purrr::map(
  fit_list,
  ~ rstan::read_stan_csv(.$output_files())
)

fit <- rstan::sflist2stanfit(fit_sflist)

Though I’ve had some trouble with the same approach with models that I’ve written (possibly due to many generated quantities?). In any case I’m happy to wait for the native implementation.

I’m holding it down here between the mountains and the sea. It would be cool to get some more Stan action up here, though. I’m not aware of anybody local using Stan professionally other than me since Xulong Wang left Jackson Labs, but I would think there would (or should) be interest in a Stan course in Bar Harbor between JAX and MDIBL. Maybe it’s happening internally there and I’m just unaware of it - but if anybody reading this wanted to make something happen I’m happy to pitch in to help organize!

6 Likes

Just merged a PR from @rok_cesnovar implementing parallelization for CmdStanR! It can be installed from the master branch:

devtools::install_github("stan-dev/cmdstanr") 

If anyone tries this out we would definitely appreciate feedback, so please let us know how it goes. Thanks!

The details for controlling the number of chains/cores are here (see num_cores argument):

8 Likes

The code presented above did not consistently work for me. Sometimes it did, and sometimes it didn’t. With parallel::mclapply, it fails and prints

Warning message:
In parallel::mclapply(1:4, model$sample, data = data_list, num_chains = 1,  :
  all scheduled cores encountered errors in user code

I tried doParallel solutions, too, but with no luck. Same sort of issue, sometimes it works, sometimes it doesn’t. Here’s a reproducible example based off of the code above, but copied for completeness:

library(cmdstanr)
set_cmdstan_path("~/cmdstan")

modfile <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.stan")
model <- cmdstan_model(modfile)
data_list <- list(N = 10, y =c(0,1,0,0,0,0,0,0,0,1))

library(doParallel)
registerDoParallel(2)

fit_list <- foreach(i = 1:4) %dopar% {
  model$sample(data = data_list, 
               num_chains = 1,
               num_samples = 1e3)
}

The error message here is somehow even more vague

Error in { : task 1 failed - "invalid 'type' (character) of argument"

Both of these work if you remove the parallel bits, and both fail more often if you add more cores and/or more repetitions.

I’m optimistic we can get something like the following to work,

fit_list <- foreach(i = 1:100) %dopar% {
  lapply(1:4,
    model$sample,
    data = data_list,
    num_chains = 1,
    num_samples = 1e3
  )
}

Here’s my session in

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils    
[6] datasets  methods   base     

other attached packages:
[1] doParallel_1.0.15   iterators_1.0.12   
[3] foreach_1.5.0       cmdstanr_0.0.0.9000

loaded via a namespace (and not attached):
 [1] codetools_0.2-16 ps_1.3.2         crayon_1.3.4    
 [4] R6_2.4.1         lifecycle_0.2.0  jsonlite_1.6.1  
 [7] backports_1.1.6  magrittr_1.5     posterior_0.0.2 
[10] pillar_1.4.3     rlang_0.4.5      rstudioapi_0.11 
[13] vctrs_0.2.4      ellipsis_0.3.0   checkmate_2.0.0 
[16] tools_3.6.1      abind_1.4-5      compiler_3.6.1  
[19] processx_3.4.2   pkgconfig_2.0.3  tibble_3.0.1 

I’ve tried this with both CmdStan v2.22.1 and CmdStan v2.23.0.

Any thoughts?

Hi @roualdes,

is there a specific reason you need mcapply or equivalent?

You can just use the num_cores argument of $sample() for running chains in parallel. See the docs @jonah posted in the last post above.

Hi @rok_cesnovar, thanks for getting back to me.

I’m trying to compare HMC algorithms, and imagining something like conditional distributions of parameters, ess_bulk, ess_tail, and rhat, conditioned on the implementation of HMC. In the end, I’d like to have samples from various posteriors for two different versions of HMC where data and tuning parameters stay the same, but the algorithm implementation differs. My simulations so far have been attempting to replicate, say R = 100 times, a default run of Stan (4 chains, 2000 iterations, etc.)

Right, it’s theoretically possible to sample 4 * R chains using an many cores as I want, and then chunk the draws into blocks of 4. This is what I just attempted, but it mysteriously ended in a segfault after a few hours. Not yet sure why. Nonetheless, my only complaint against fit <- mod$sample(..., num_chains = 4*R, ...), and in favor of foreach(r = 1:R) %dopar% {mod$sample(...); ...}, is that at some point after storing fit, the call fit$draws() is really slow. In the foreach solution, I could wrap $draws() (and various other summary calculations) into the parallel for loop.

There’s sufficient way to work around this, but I thought if I stumbled across this issue maybe others have/would too.

Thanks, again.

My mistake it is not the function call fit$draws() that appears to take a long time.

Something after it prints “Total execution time: … seconds.” and before my script calls fit$draws() is taking up a lot of time. Haven’t figured out what it is yet. Still looking.

edit: attempting to be clear, my script looks like

print("fitting model")
fit <- mod$sample(..., num_chains = 400, num_cores = 10, ...)
print("reading draws")
draws <- fit$draws()

Yes, sample() currently reads back the resulting samples from the csv and stores them in the fit object.

Draws() then just returns them.

We plan on changing that for the beta release (make this optional), but currently we do it that way (in order to show the divergences and max treedepth warnings).