Hello,
I’m running brms
with the cmdstanr
backend on a research computer cluster, and the models take a while to run (~25 days). For some of these models, when the sampling finishes the execution is halted because the file cannot be found.
For example,
Compiling Stan program...
Start sampling
Running MCMC with 1 chain, with 24 thread(s) per chain...
Chain 1 Iteration: 1 / 5000 [ 0%] (Warmup)
Chain 1 Iteration: 100 / 5000 [ 2%] (Warmup)
Chain 1 Iteration: 200 / 5000 [ 4%] (Warmup)
Chain 1 Iteration: 300 / 5000 [ 6%] (Warmup)
Chain 1 Iteration: 400 / 5000 [ 8%] (Warmup)
Chain 1 Iteration: 500 / 5000 [ 10%] (Warmup)
Chain 1 Iteration: 600 / 5000 [ 12%] (Warmup)
Chain 1 Iteration: 700 / 5000 [ 14%] (Warmup)
Chain 1 Iteration: 800 / 5000 [ 16%] (Warmup)
Chain 1 Iteration: 900 / 5000 [ 18%] (Warmup)
Chain 1 Iteration: 1000 / 5000 [ 20%] (Warmup)
Chain 1 Iteration: 1100 / 5000 [ 22%] (Warmup)
Chain 1 Iteration: 1200 / 5000 [ 24%] (Warmup)
Chain 1 Iteration: 1300 / 5000 [ 26%] (Warmup)
Chain 1 Iteration: 1400 / 5000 [ 28%] (Warmup)
Chain 1 Iteration: 1500 / 5000 [ 30%] (Warmup)
Chain 1 Iteration: 1600 / 5000 [ 32%] (Warmup)
Chain 1 Iteration: 1700 / 5000 [ 34%] (Warmup)
Chain 1 Iteration: 1800 / 5000 [ 36%] (Warmup)
Chain 1 Iteration: 1900 / 5000 [ 38%] (Warmup)
Chain 1 Iteration: 2000 / 5000 [ 40%] (Warmup)
Chain 1 Iteration: 2100 / 5000 [ 42%] (Warmup)
Chain 1 Iteration: 2200 / 5000 [ 44%] (Warmup)
Chain 1 Iteration: 2300 / 5000 [ 46%] (Warmup)
Chain 1 Iteration: 2400 / 5000 [ 48%] (Warmup)
Chain 1 Iteration: 2500 / 5000 [ 50%] (Warmup)
Chain 1 Iteration: 2501 / 5000 [ 50%] (Sampling)
Chain 1 Iteration: 2600 / 5000 [ 52%] (Sampling)
Chain 1 Iteration: 2700 / 5000 [ 54%] (Sampling)
Chain 1 Iteration: 2800 / 5000 [ 56%] (Sampling)
Chain 1 Iteration: 2900 / 5000 [ 58%] (Sampling)
Chain 1 Iteration: 3000 / 5000 [ 60%] (Sampling)
Chain 1 Iteration: 3100 / 5000 [ 62%] (Sampling)
Chain 1 Iteration: 3200 / 5000 [ 64%] (Sampling)
Chain 1 Iteration: 3300 / 5000 [ 66%] (Sampling)
Chain 1 Iteration: 3400 / 5000 [ 68%] (Sampling)
Chain 1 Iteration: 3500 / 5000 [ 70%] (Sampling)
Chain 1 Iteration: 3600 / 5000 [ 72%] (Sampling)
Chain 1 Iteration: 3700 / 5000 [ 74%] (Sampling)
Chain 1 Iteration: 3800 / 5000 [ 76%] (Sampling)
Chain 1 Iteration: 3900 / 5000 [ 78%] (Sampling)
Chain 1 Iteration: 4000 / 5000 [ 80%] (Sampling)
Chain 1 Iteration: 4100 / 5000 [ 82%] (Sampling)
Chain 1 Iteration: 4200 / 5000 [ 84%] (Sampling)
Chain 1 Iteration: 4300 / 5000 [ 86%] (Sampling)
Chain 1 Iteration: 4400 / 5000 [ 88%] (Sampling)
Chain 1 Iteration: 4500 / 5000 [ 90%] (Sampling)
Chain 1 Iteration: 4600 / 5000 [ 92%] (Sampling)
Chain 1 Iteration: 4700 / 5000 [ 94%] (Sampling)
Chain 1 Iteration: 4800 / 5000 [ 96%] (Sampling)
Chain 1 Iteration: 4900 / 5000 [ 98%] (Sampling)
Chain 1 Iteration: 5000 / 5000 [100%] (Sampling)
Chain 1 finished in 2204150.0 seconds.
Error in read_cmdstan_csv(self$output_files(), variables = "", sampler_diagnostics = if (!fixed_param) c("treedepth__", :
Assertion on 'files' failed: File does not exist: '/tmp/RtmpykA2YK/file1c9ba304aee56_threads-202202211255-1-49bed2.csv'.
Calls: brm ... read_cmdstan_csv -> <Anonymous> -> makeAssertion -> mstop
Execution halted
For some models, everything proceeds normally, and a file is saved after sampling finishes. For other models (maybe 2 out of 3), I get an error message similar to the one above. I am not able to fine a pattern as to why some models save normally and some do not.
I’d appreciate any advice/thoughts! When I run short models or models with few iterations they save just fine, and I have no issues.
This is just a guess: I’m wondering if the tmp directory is getting deleted/overwritten for some reason? Is there a way I can force R/brms/cmdstanr to use a permanent location over a tmp directory?
- Operating System: Red Hat Enterprise Linux Server 7.4 (Maipo)
- brms Version: 2.14.4
- cmdstanr Version: 0.3.0
- cmdstan Version: 2.26.1
Thank you!
Peter