Issue using rstan with BiocParallel and MulticoreParam back-end

phauchamps · January 23, 2021, 7:32am

Hi there,

In the context of a proteomics research work, I’d like to run the same model on a number of different datasets (1000+) in parallel on a number of cores (6 in my case). For this I am trying to use BiocParallel with its MulticoreParam back-end (on Linux). Note I don’t use the paralelization feature present in Stan (cores = 1).

This works fine for a fairly reasonable number of models (100) but when increasing further the number of models, while keeping the same number of cores, I get systematically an error message :

Error: BiocParallel errors
element index: 271 (or other element index depending on run)
unable to load shared object ‘tmp/Rtmp7TxrWl/file249d104be97c44.so’
tmp/Rtmp7TxrWl/file249d104be97c44.so : file too short

and as soon as this happens the rest of the jobs all fail with the same
type of error message.

I tried to run the batch in serial mode (SerialParam in BiocParallel) and this works fine, so it is unlikely to be due to the data specifics of one model in the series.

Since I suspected it might be related to a resource shortage issue (e.g. memory), I also tried to decrease the number of cores used in order to limit the number of jobs run simultaneously, but even with 2 cores the issue appears. I also tried to decrease the number of chain iterations to a very low number but again the issue is still there.

Anyone having experienced the same kind of issue in the past and having found a solution for this ?

I will also raise an issue on github, both for Stan, and Biocparallel.

Thanks a lot,

Philippe

wds15 · January 23, 2021, 8:12am

Are you recompiling the model for each run?

phauchamps · January 23, 2021, 8:59am

@wds15 : no I am using the same pre-compiled .rds object, otherwise the compilation time would be just prohibitive (my model is fairly complex). Do you think my issue might be related to concurrent access to this .rds file ?
What is striking, though, is that the error always happen around the same number of already run tasks, i.e. between 265th and 280th tasks, and this even if I select the tasks in a different order!

wds15 · January 23, 2021, 10:19am

Can you sketch the order of things happening?

phauchamps · January 24, 2021, 4:51pm

I finally created a simpler case that allowed me to reproduce the error on a more limited scale. While playing with it, I noticed that when I was first removing the precompiled model from the disk (.rds file), and let Stan recompile the model before launching the tasks, the sharing of the compiled model to the different tasks could be done without any error occuring. While when I was reading the precompiled model from disk, the above described error sysmatically happaned.

I think the mistake probably lies in the following piece of code :

0. check that `modelScript.stan` exists

stanScriptFile ← paste0(modelScript, “.stan”)
if(!file.exists(stanScriptFile))
stop(paste0(stanScriptFile, " does not exist!"))

1. check if `modelScript.rds` exists.

2. if not, compile it. Then save it as rds.

3. if `modelScript.rds` exists, make sure it is more recent

than `modelScript.stan`.

4 if more recent, load it, otherwise execute step 2

stanModelFile ← paste0(modelScript, “.rds”)
compile ← TRUE
if (file.exists(stanModelFile)){
fileTimes ← file.mtime(c(stanScriptFile, stanModelFile))
if(fileTimes[2] > fileTimes[1])
compile ← FALSE
}

if(compile)
{
cat(paste0("Compiling Stan script : ", stanScriptFile, “\n”))
stanc_ret ← stanc(file = stanScriptFile, verbose = TRUE)

stan_mod <- stan_model(stanc_ret = stanc_ret,
                       verbose = TRUE,
                       auto_write = TRUE)
cat("Model compilation successful! Wrighting model on disk...\n")
saveRDS(object = stan_mod, file = stanModelFile)
cat("Done!\n")

} else {
cat(paste0("Found an updated Stan model : ", stanModelFile, “\n”))
cat(“Uploading…”)
stan_mod ← readRDS(file = stanModelFile)
cat(“Done!\n”)
}

wds15 · January 25, 2021, 5:33am

This all looks fine to me. The only thing which can go wrong is that in a given R process the above should only ever happen once. I mean, if in a given R process you have a „fit“ function doing all the steps above, then there should only be one fit function call. Loading the same model multiple times and then doing multiple fits in the same R process successively with a reloaded model can cause trouble.

I would also suggest to ensure that the first run is being called before doing mass batch submission to the cluster to handle the compilation first.

If all that does not help, then maybe consider moving to cmdstanr. Then the issues you are seeing should not occur almost for sure.

phauchamps · January 25, 2021, 7:46am

Thanks. Indeed I am loading the model only once (or compile it if the precompiled model object is present on disk) and only then am triggering the whole calculation.
I’ll keep the issue I have opened on the rstan github repo for the time being, with the hope that someone can track the issue, since I have now a simple case that always shows the error (at least on my environment).

Topic		Replies	Views
Cannot run stan model in parallel under Rstudio (within a R package) RStan	1	1634	May 18, 2018
Run multiple stan models in parallel Developers rstan	4	2210	June 13, 2020
Issue running rstan models using foreach RStan rstan	5	602	July 12, 2021
Parallelizing STAN with foreach/dopar: model being returned instead of result RStan optimization , fitting-issues , paralellization	3	1667	July 3, 2020
Backend Errors when fitting many models in parallel Modeling	4	1060	March 21, 2022

Issue using rstan with BiocParallel and MulticoreParam back-end

0. check that modelScript.stan exists

1. check if modelScript.rds exists.

2. if not, compile it. Then save it as rds.

3. if modelScript.rds exists, make sure it is more recent

than modelScript.stan.

4 if more recent, load it, otherwise execute step 2

Related topics

0. check that `modelScript.stan` exists

1. check if `modelScript.rds` exists.

3. if `modelScript.rds` exists, make sure it is more recent

than `modelScript.stan`.