CmdStanR reports error "All variables in all chains must have the same length" after apparently successful sampling

I apologize for not having a reproducible example. I’ll try to sketch the situation, and I ask for any pointers to help me sort out what’s going on; I don’t expect anyone to have a solution, as much as that would be really nice.

To cut to the chase: from where is that error originating? I grepped through the Stan source code, CmdStanR, the generated executable, and couldn’t find that error message or any fragment of it. What is being complained about? I mean, what specific thing is being looked at and found incorrect? Last but not least, any advice about how to avoid the error?

(EDIT: I found the message is apparently coming from here: posterior/as_draws_list.R at master · stan-dev/posterior · GitHub Still no idea about what went wrong. The code which produces the error is pretty simple, just checking that all chains are the same length as the first one, but how did it come to pass that they have different lengths is still a mystery.)

In more detail. I have an ODE model with about 15 equations and as many free parameters, and about 1000 data. I have run this model or close variations of it many times, probably in the dozens of times. This time I tweaked some of the parameters and ran it again, and bumped into this error at the end of sampling.

Note that the error message came after the sampler reported that all chains had concluded successfully. I have abridged the log output for clarity.

Chain 8 Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 8 finished in 1868680.0 seconds.
[...]
Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 1 finished in 1895910.0 seconds.
[...]
Chain 5 Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 5 finished in 1940930.0 seconds.
[...]
Chain 6 Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 6 finished in 2037400.0 seconds.
[...]
Chain 3 Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 3 finished in 2521350.0 seconds.
[...]
Chain 2 Iteration: 1000 / 1000 [100%]  (Sampling) 
Chain 2 finished in 2635690.0 seconds.
[...]
Chain 4 Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 4 finished in 2724520.0 seconds.
Chain 7 Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 7 finished in 2732470.0 seconds.

All 8 chains finished successfully.
Mean chain execution time: 2294618.8 seconds.
Total execution time: 2732505.2 seconds.
Error: All variables in all chains must have the same length.
Execution halted

Here is an excerpt showing how the sampler is launched. This much of the code hasn’t changed in a long time.

8 -> n.chains
500 -> n.post
250 -> n.burn
1 -> n.thin

(n.post + n.burn) * n.thin -> n.iter
n.burn * n.thin -> n.burnin

my.model$sample (data = foo.data,
                 iter_warmup = n.burnin,
                 iter_sampling = n.iter,
                 thin = n.thin,
                 seed = 123,
                 chains = n.chains,
                 parallel_chains = n.chains,
                 refresh = 10,
                 adapt_delta = 0.9,
                 max_treedepth = 10,
                 step_size = 0.1) -> cmdstan.object

make.save.sample.filename <- function () {
    paste0 ("foo-output-", git.version.string, ".RDS")
}

make.save.sample.filename () -> save.sample.filename

I am launching R via R CMD BATCH control_foo.R where control_foo.R contains the above code. It appears from the control_foo.Rout output that the error message is coming from my.model$sample since the following function definition and file name assignment aren’t echoed (with +) in control_foo.Rout.

Thanks for any insights anyone can offer.

Robert Dodier

I now remember that at some point during the sampling, the file system ran out of space on the partition containing the CmdStan output files (somewhere under /tmp). So I surmise that some samples weren’t written to the output file, and therefore at the end when the summary was to be made, the lists of samples were different lengths.

It’s kind of a bummer that Stan kept going after a write failed, when that would necessarily lead to a failure at the end … I would rather find out sooner rather than later. Also kind of a bummer that the output files were nuked at program termination; it would be pretty helpful to be able to inspect whatever was left. I don’t know what could be concluded from incomplete output files, but at any rate it would be better than having them simply vanish.

This sounds like the most likely answer to why this occurred. I’ve seen similar errors result from threaded code which lead to competing programs overwriting each other’s output files, but that does not seem to be the case here.

Obviously it’s too late to go back in time for this specific run, but for future reference you can pass output_dir to sample() which will override the default behavior of saving them to a temporary directory.