Stan sampler gets stuck

Hello all,

I am fitting a model with 4 parallel chains in pystan 2.19.0.0 using pycharm console. After two chains finish sampling, the entire process is stuck and two remaining chains never start. I use a linux centos7 cluster and I designate 50GB to the process. Each of the two chains takes 20GB.
Any advise would be appreciated.

It seems to me that the third chain has a large R hat and it dose not converge.
In my experience, the MCMC algorithm takes a long time if the chain dose not converge. If you can find a seed in which the sampling (using only a single chain ) dose not converge, then the problem is caused by the model or data …etc.

Thanks, although I am not sure this is the issue with my case. When I allocate enough memory (100GB), all chains are running at the same time. I am surprised the multiprocessing cannot move from the first two chains to the last two when I have less memory available. So maybe there is still a way around it.

Can you try to run the script from command line?

I am still debugging, so must run from the console. Over night four chains finished running, but it failed with the error:

multiprocessing.pool.MaybeEncodingError: Error sending result: ‘[(0, <stanfit4anon_model_efc859a5164be00d3af9579329c2196d_4498398224597621611.PyStanHolder object at 0x2b03802f6eb0>)]’. Reason: ‘error("‘i’ format requires -2147483648 <= number <= 2147483647")’

So frustrating.

Oh, that is true. Multiprocessing doesn’t work in that case. That is a pickling error.

You need to run chains serial (njobs=1)

Maybe run your model with one chain from the script and save output with arviz and run that script n times. Then combine chains once all have finished (arviz.concat).

yes, thank you! I just started a try with with n_jobs=1. Is this because the model is too big?
Also, I read that this issue was solved in pystan3 - will it be released anytime soon?

Yes. It might be fixed in python 3.8 where pickle/multiprocessing uses the correct flags.

PyStan3 were supposed to fix this, but actually changed back to process based approach, but minimum version is now 3.8 so… We need to test this.

I see, thanks. I hope n_jobs=1 will work. I am still not sure what’s the solution to my original issue though… (or is it a branch of the same problem?)

PyStan 2 uses pickle to move the final output draws around. There are some size limits on how much you can pickle at one time. I suspect you are exceeding these limits.

I think the easiest way to solve this problem might be to thin your results so your final chain is smaller. Can you keep 1 out of 100 draws?

PyStan 3, unlike PyStan 2, does not use pickle to store the sampler output. If the pickle size limit is the source of the problem, PyStan 3 will not encounter it.

Also, PyStan 3 will work fine on Linux with version 3.7 of Python.

Thinning is a good idea. Didn’t think about that. Will def try. Thanks!

This failed pickle was handled in multiprocessing library, and I think futures processpool still uses multiprocessing under the hood?

If not, then that is great!

Update - setting thin=5 did the job. Also running pystan commands from command line makes everything faster. Any ideas why pycharm slows things down? @ahartikainen, I don’t know if you remember, but I had an issue with opening multiple *.pkl files in the came code, so runinng it from command line works well.

I believe the failed pickling is due to the size of the return value.
PyStan 2 pickles all the draws and returns them to the parent process.
PyStan 3 doesn’t pickle the draws (ever).

Cannot wait to use PyStan3!