Stan sampler gets stuck

nerpa · May 15, 2020, 10:53pm

Hello all,

I am fitting a model with 4 parallel chains in pystan 2.19.0.0 using pycharm console. After two chains finish sampling, the entire process is stuck and two remaining chains never start. I use a linux centos7 cluster and I designate 50GB to the process. Each of the two chains takes 20GB.
Any advise would be appreciated.

Jean_Billie · May 15, 2020, 11:42pm

It seems to me that the third chain has a large R hat and it dose not converge.
In my experience, the MCMC algorithm takes a long time if the chain dose not converge. If you can find a seed in which the sampling (using only a single chain ) dose not converge, then the problem is caused by the model or data …etc.

nerpa · May 15, 2020, 11:48pm

Thanks, although I am not sure this is the issue with my case. When I allocate enough memory (100GB), all chains are running at the same time. I am surprised the multiprocessing cannot move from the first two chains to the last two when I have less memory available. So maybe there is still a way around it.

ahartikainen · May 16, 2020, 4:40am

Can you try to run the script from command line?

nerpa · May 16, 2020, 9:13am

I am still debugging, so must run from the console. Over night four chains finished running, but it failed with the error:

multiprocessing.pool.MaybeEncodingError: Error sending result: ‘[(0, <stanfit4anon_model_efc859a5164be00d3af9579329c2196d_4498398224597621611.PyStanHolder object at 0x2b03802f6eb0>)]’. Reason: ‘error("‘i’ format requires -2147483648 <= number <= 2147483647")’

So frustrating.

ahartikainen · May 16, 2020, 9:16am

Oh, that is true. Multiprocessing doesn’t work in that case. That is a pickling error.

You need to run chains serial (njobs=1)

Maybe run your model with one chain from the script and save output with arviz and run that script n times. Then combine chains once all have finished (arviz.concat).

nerpa · May 16, 2020, 9:22am

yes, thank you! I just started a try with with n_jobs=1. Is this because the model is too big?
Also, I read that this issue was solved in pystan3 - will it be released anytime soon?

ahartikainen · May 16, 2020, 10:00am

Yes. It might be fixed in python 3.8 where pickle/multiprocessing uses the correct flags.

PyStan3 were supposed to fix this, but actually changed back to process based approach, but minimum version is now 3.8 so… We need to test this.

nerpa · May 16, 2020, 10:06am

I see, thanks. I hope n_jobs=1 will work. I am still not sure what’s the solution to my original issue though… (or is it a branch of the same problem?)

ariddell · May 16, 2020, 11:17am

PyStan 2 uses pickle to move the final output draws around. There are some size limits on how much you can pickle at one time. I suspect you are exceeding these limits.

I think the easiest way to solve this problem might be to thin your results so your final chain is smaller. Can you keep 1 out of 100 draws?

ariddell · May 16, 2020, 11:22am

PyStan 3, unlike PyStan 2, does not use pickle to store the sampler output. If the pickle size limit is the source of the problem, PyStan 3 will not encounter it.

Also, PyStan 3 will work fine on Linux with version 3.7 of Python.

nerpa · May 16, 2020, 11:31am

Thinning is a good idea. Didn’t think about that. Will def try. Thanks!

ahartikainen · May 16, 2020, 12:25pm

This failed pickle was handled in multiprocessing library, and I think futures processpool still uses multiprocessing under the hood?

If not, then that is great!

nerpa · May 16, 2020, 7:39pm

Update - setting thin=5 did the job. Also running pystan commands from command line makes everything faster. Any ideas why pycharm slows things down? @ahartikainen, I don’t know if you remember, but I had an issue with opening multiple *.pkl files in the came code, so runinng it from command line works well.

ariddell · May 17, 2020, 2:18pm

I believe the failed pickling is due to the size of the return value.
PyStan 2 pickles all the draws and returns them to the parent process.
PyStan 3 doesn’t pickle the draws (ever).

nerpa · May 17, 2020, 2:28pm

Cannot wait to use PyStan3!

Topic		Replies	Views
Trouble with PyStan 3 and python multiprocessing PyStan pystan	9	2306	October 27, 2021
Parallel chains hanging on Ubuntu 18.04 General	3	418	April 30, 2020
Pickling error? General performance	4	538	August 15, 2018
Compiler getting stuck when chain number is larger than 1 PyStan	9	1035	February 25, 2020
New to Pystan, Always get this error when attempting to sample: ModuleNotFoundError: No module named 'stanfit4anon_model...' Modeling pystan	9	5280	February 9, 2022

Stan sampler gets stuck

Related topics