Cross-cluster contamination in parallel cluster?

Hi all,

I am fitting different data to a stan model (using cmdstanR) in parallel on my own PC (using parallel::makeCluster). After much debugging and trying different scenarios, it turns out that when I run the models in parallel there seems to be cross-cluster contamination but only during the sampler phase that sets the initial values, stepsize, and inverse mass matrix. This gives the following error:

Error during model fitting: ‘init’ has the wrong length. See documentation of ‘init’ argument.

When I open 5 different instances of R/Rstudio and run them in parallel manually without the clusters, the same error occurs. However, when I run the models sequentially by adding a sleep.system function with the locally generated clusters, it runs fine.

Because the code itself is quite long (and a reprex for this is a lot of work), I’ll first share some code snippets and potentially relevant information:
The models are pre-compiled. I am using chkptstanr (version .20) to call cmdstanr, however due to testing I don’t think the package is the issue (but perhaps there is some interaction). all output paths are unique, even the base chain names (output_basename) are unique. Each run uses a different seed.

Does anyone have anyone ideas here?

# Packages
library(glue)
library(doSNOW)
library(foreach)
library(chkptstanr)
library(MASS)
library(cmdstanr)
library(dplyr)
library(stringr)
library(brms)

chkpt_stan(model_code = stan_model,
                   data = stan_data,
                   iter_adaptation = 150,
                   iter_warmup = warmup_iters,
                   iter_sampling = sampling_iters,
                   iter_per_chkpt = chkpt_iters,
                   parallel_chains = 4,
                   threads_per = 1,
                   chkpt_progress = TRUE,
                   control = NULL,
                   seed = seed,
                   stop_after = dynamic_stop,
                   reset = FALSE,
                   path = path_,
                   output_basename = paste0("chain_", model, ".", 
                                            dataset, ".",
                                            run, "."))

## set up parallel backend ---------------------------------------------------

if (hyper_parallel) {
  cluster = parallel::makeCluster(
    models_in_parallel,
    outfile = glue("output/consoleOut.txt")
  )
  doSNOW::registerDoSNOW(cluster)
}

# Nested approach

foreach(model_id = models, 
        .packages = c('chkptstanr', 'MASS', 'cmdstanr', 'dplyr', 'stringr', 'brms'), 
        .errorhandling = "stop") %:%
  foreach(dataset_id = 1:N_datasets, 
          .packages = c('chkptstanr', 'MASS', 'cmdstanr', 'dplyr', 'stringr', 'brms'), 
          .errorhandling = "stop") %:%
  foreach(run_id = 1:N_runs,
          .packages = c('chkptstanr', 'MASS', 'cmdstanr', 'dplyr', 'stringr', 'brms'),
          .errorhandling = "stop") %dopar% {
            
            run_analysis(model = model_id, dataset = dataset_id, run = run_id)
            
          } # foreach close


Please provide this additional information in addition to your question:

  • Windows 11
  • CmdStan Version: 2.35.0
  • CmdStanr Version: 0.8.1.9000

Interesting finding. I wonder if some cout calls in source would help us figure out what goes in to the init.

1 Like

Would you be so kind to clarify a bit more?
“The source” equals to the C++ file after compilation of the model right?
And do you have any suggestion what to cout?
I have never done this :)

Oh yes, this is probably something devs need test. Adding cout or logging calls to source is not straight forward.