Parallel the same model fitting for differen data

Michelle · January 28, 2022, 2:17am

Hi Stan community,

I am trying to fit the same model (using cmdstanr) for different data inputs data_list_all[[idx]]. In this case, I thought I could use parallel for loop to save some time. In the following code, I tried foreach and %dopar% from library(parallel) library(foreach) library(doParallel). It works when I decrease the data length of each data_list_all[[idx]] in each loop (for testing), but when using full length data, the fitting results of a few data_list_all[[idx]] were not properly saved, because cmdstan output files were not found in the temp folder. If I just run sequentially with full data size, everyone fitted very well.

Do you know what’s going on here? Maybe we have another better choice? My gut feeling is that parallel chains in Stan may not fully completable with foreach loop such that one finished chain for data_list_all[[2]] was overwritten by a chain for data_list_all[[6]], before other chains complete for data_list_all[[2]].

Another way I am thinking is to index parameters in the model and feed all data with the same indexing, as long as I don’t pool parameters over the data set, it should be identical to the for loop solution. So you think in this case I can tell Stan to use 4 cores per data_list_all[[idx]]?

Thank you very much :)

  fit_list_all <- foreach(
    idx = fit_idx
  ) %dopar% {
    mod$sample(
      data = data_list_all[[idx]], iter_warmup = 1000, iter_sampling = 1000,
      chains = 4, parallel_chains = 4, show_messages = F
    )
  }

Bob_Carpenter · February 1, 2022, 10:14pm

You may be right, despite R randomly generating file names. You could try using the output_dir arguments or output_basename functions in the method sample() on a cmdstan_model in cmdstanr.

I’m pinging @jgabry, who should know the answer here.

Also, you want to be careful to not spawn more jobs than you have cores. Even then, I find on my rather beefy Xeon-based iMac Pro that it can’t run 8 chains in parallel nearly as fast as 1 sequentially. So you might not be getting a lot of gains from parallelization if you’re getting close to or exceeding the number of cores you have.

Michelle · February 2, 2022, 2:08pm

I will try to specify cmdstan output folder. Hopefully @jonah has better solutions.

You are absolutely right, this already happened to me. I do the following to avoid this issue.

parallel::detectCores()
n.cores <- parallel::detectCores() - 2

karimn · November 8, 2023, 7:38pm

@jonah Do you know why this might be happening? I’m seeing the same thing use furrr::future_map. I fit fine and I’m able to extract posterior samples fine but only if I don’t do that in parallel as well. So if I have a list of fit objects and run something like furrr::future_map(list_of_fits, \(f) spread_rvars(f, x[i])) I get errors that the csv files cannot be found.

jsocolar · November 9, 2023, 12:09am

What happens if you pass the output_basename parameter to cmstanr::sample such that each iteration of the parallel map or loop writes to a deterministically unique csv filename?

jonah · November 9, 2023, 1:10am

I like @jsocolar’s suggestion. Curious if that resolves the issue.

karimn · November 17, 2023, 5:12pm

Trying it now. Will let you know if I hit any problems.

karimn · November 22, 2023, 3:55pm

It works. Thanks, all!

jonah · November 22, 2023, 4:54pm

Great, thanks for following up and letting us know!

Topic		Replies	Views
Fitting the same model to many datasets Modeling performance , cmdstanpy	9	1505	November 22, 2022
Parallelizing STAN with foreach/dopar: model being returned instead of result RStan optimization , fitting-issues , paralellization	3	1669	July 3, 2020
Advice for parallelizing many Stan models with multiple chains Modeling	1	596	September 20, 2022
Parallelization in Stan's models General rstan	2	61	May 16, 2025
Parallel with cmdstanr? Interfaces cmdstanr	8	1854	April 25, 2020

Parallel the same model fitting for differen data

Related topics