Error with cmstanr on cluster: 'iteration_ids' of bound objects do not match

When running the model with cmstanr, I get the following message:

All 6 chains finished succesfully.
Mean chain execution time: 6387.6 seconds.
Total execution time: 7798.1 seconds.
Error: 'iteration_ids' of bound objects do not match.
Execution halted

I was able to run this model with fewer chains and less iterations (i.e. 4 chains with 1000 + 1000 iterations). Here are I have 6 chains with 1000 + 2000 iterations. Not sure how interpret this error and no luck when using google.

Any chance you can upload the produced .csv files? Or send them via Discourse DM. This error occurs in read_sample_csv and can easily be reproduced with csv files. The error itself is produced by the posterior package. So its most likely an error on what we call the posterior package functions with.

Any chance the chains would have different number of samples?

The chains have the same number of samples.

Where would the csv file be stored?

The output csv files are located in the /tmp folder. But if the R session stopped, they are most likely lost. You can normally obtain them via fit$output_files() but given that sampling stopped, that wont work.

You can see the output files path from the console output also:

Running MCMC with 1 chain(s) on 1 core(s)...

Running ./example 'id=1' random 'seed=123' output 'file=/tmp/RtmpTuFoha/example-202003221546-1-40bd82.csv' 'method=sample'

I see.

I’ve indeed been using fit$output_files() to extract the results and then save an rstan fit object. Unfortunately, I don’t run an interactive R session on the cluster, so any temp files is indeed lost.

What does usually cause the error with posterior?

Honestly, I havent seen this one yet, the only thing I can think off is that a chain was somehow of a different length (maybe corrupt csv due to lack of hard disk space).

My diagnose for the time being is that I don’t get the issue when only running 4 chains. So it might be due to running 6 chains, and it could be a cluster issue. I’ll post more results as I get them.

1 Like

Some updates, though no solution yet:

  • I have issues with 4 chains and 1000 + 3000 iterations, but not with 1000 + 2000 iterations with one model. This corroborate the memory issue hypothesis.
  • That said, another model runs fine with 6 chains and 1000 + 2000 iterations.
  • I’ll also made sure I have enough local memory and I’m not hitting the ram limit on the job.

I end up using two jobs to get my 12,000 samples, which is a good enough solution for now.