Issue using rstan with BiocParallel and MulticoreParam back-end

Hi there,

In the context of a proteomics research work, I’d like to run the same model on a number of different datasets (1000+) in parallel on a number of cores (6 in my case). For this I am trying to use BiocParallel with its MulticoreParam back-end (on Linux). Note I don’t use the paralelization feature present in Stan (cores = 1).

This works fine for a fairly reasonable number of models (100) but when increasing further the number of models, while keeping the same number of cores, I get systematically an error message :

Error: BiocParallel errors
element index: 271 (or other element index depending on run)
unable to load shared object ‘tmp/Rtmp7TxrWl/file249d104be97c44.so’
tmp/Rtmp7TxrWl/file249d104be97c44.so : file too short

and as soon as this happens the rest of the jobs all fail with the same
type of error message.

I tried to run the batch in serial mode (SerialParam in BiocParallel) and this works fine, so it is unlikely to be due to the data specifics of one model in the series.

Since I suspected it might be related to a resource shortage issue (e.g. memory), I also tried to decrease the number of cores used in order to limit the number of jobs run simultaneously, but even with 2 cores the issue appears. I also tried to decrease the number of chain iterations to a very low number but again the issue is still there.

Anyone having experienced the same kind of issue in the past and having found a solution for this ?

I will also raise an issue on github, both for Stan, and Biocparallel.

Thanks a lot,

Philippe

Are you recompiling the model for each run?

@wds15 : no I am using the same pre-compiled .rds object, otherwise the compilation time would be just prohibitive (my model is fairly complex). Do you think my issue might be related to concurrent access to this .rds file ?
What is striking, though, is that the error always happen around the same number of already run tasks, i.e. between 265th and 280th tasks, and this even if I select the tasks in a different order!

Can you sketch the order of things happening?

I finally created a simpler case that allowed me to reproduce the error on a more limited scale. While playing with it, I noticed that when I was first removing the precompiled model from the disk (.rds file), and let Stan recompile the model before launching the tasks, the sharing of the compiled model to the different tasks could be done without any error occuring. While when I was reading the precompiled model from disk, the above described error sysmatically happaned.

I think the mistake probably lies in the following piece of code :

0. check that modelScript.stan exists

stanScriptFile ← paste0(modelScript, “.stan”)
if(!file.exists(stanScriptFile))
stop(paste0(stanScriptFile, " does not exist!"))

1. check if modelScript.rds exists.

2. if not, compile it. Then save it as rds.

3. if modelScript.rds exists, make sure it is more recent

than modelScript.stan.

4 if more recent, load it, otherwise execute step 2

stanModelFile ← paste0(modelScript, “.rds”)
compile ← TRUE
if (file.exists(stanModelFile)){
fileTimes ← file.mtime(c(stanScriptFile, stanModelFile))
if(fileTimes[2] > fileTimes[1])
compile ← FALSE
}

if(compile)
{
cat(paste0("Compiling Stan script : ", stanScriptFile, “\n”))
stanc_ret ← stanc(file = stanScriptFile, verbose = TRUE)

stan_mod <- stan_model(stanc_ret = stanc_ret,
                       verbose = TRUE,
                       auto_write = TRUE)
cat("Model compilation successful! Wrighting model on disk...\n")
saveRDS(object = stan_mod, file = stanModelFile)
cat("Done!\n")

} else {
cat(paste0("Found an updated Stan model : ", stanModelFile, “\n”))
cat(“Uploading…”)
stan_mod ← readRDS(file = stanModelFile)
cat(“Done!\n”)
}

This all looks fine to me. The only thing which can go wrong is that in a given R process the above should only ever happen once. I mean, if in a given R process you have a „fit“ function doing all the steps above, then there should only be one fit function call. Loading the same model multiple times and then doing multiple fits in the same R process successively with a reloaded model can cause trouble.

I would also suggest to ensure that the first run is being called before doing mass batch submission to the cluster to handle the compilation first.

If all that does not help, then maybe consider moving to cmdstanr. Then the issues you are seeing should not occur almost for sure.

Thanks. Indeed I am loading the model only once (or compile it if the precompiled model object is present on disk) and only then am triggering the whole calculation.
I’ll keep the issue I have opened on the rstan github repo for the time being, with the hope that someone can track the issue, since I have now a simple case that always shows the error (at least on my environment).