RAM usage increasing linearly with pystan sampling

hikmathasan · November 9, 2021, 12:40pm

I’m using pystan3 to sample a model with a large number of samples (num_samples=1E5 and num_chains=1 / num_chains=5e4 and num_chains=4). The RAM usage during the samping step (stan.model.Model.sample) increases linearly with time - currently I require 18GB of RAM to sample my model - which must be too much. This was not an issue in pystan2 on windows, where the RAM usage was constant during the sampling process.

Using the tracemalloc library I have found that during the sampling process, L#139 in httpstan/services_stub.py: messages_files[s].write(messages_compressobjs[s].compress(message))'] allocates an increasing amount of RAM. Has anyone come across this issue before?

Operating System: Ubuntu 20.04
Interface Version: Pystan v3.3.0 and httpstan v4.6.1
Compiler/Toolkit: gcc v11.1.0

ariddell · November 9, 2021, 12:59pm

This sounds about right. PyStan 3 (via httpstan) collects draws in memory before writing them to disk when sampling is finished. PyStan 2 may have pre-allocated the required space. Max memory use should be about the same.

Bob_Carpenter · November 9, 2021, 5:06pm

You shouldn’t need 1e5 (100K) samples for Bayesian inference. I’d suggest running for fewer iterations or if you have really long autocorrelation times, thinning the output.

The principle is that expectation estimates have MCMC standard error equal to posterior sd / sqrt(ESS), where ESS is the effective sample size. At ESS = 100, you get std error equal to 1/10 of sd. It requires ESS = 10,000 to get that to 1/100 of sd. No matter how many MCMC draws there are, we can’t get rid of posterior sd as controlling our uncertainty in our estimate. The standard error is just an estimate of the error in estimating the posterior mean, not the uncertainty in estimating the parameter (or variance or other expectation).

JoeReynolds · November 12, 2021, 11:12am

For this model, in addition to expectations, we are also interested in estimating centred credible intervals various parameters of the model, so are interested in, for example, the 2.5th and 97.5th percentiles of the distribution of a model parameter in the posterior. It is my understanding that MCMC error grows towards the tail of the posterior which would be mitigated somewhat by sampling a larger number of posterior draws. Empirically, we have observed 100,000 iterations is just about sufficient to give stable estimates (to two significant figures) for quantiles this large/small over repeated runs of the model fit.

Topic		Replies	Views
Ram problem when pytan is running with very large files General	3	40	October 30, 2024
Possible to lower memory usage? General	9	2169	April 22, 2022
Fit.extract() takes a long time PyStan	9	3078	May 22, 2018
Cache problem with a big model Modeling pystan , fitting-issues	3	523	November 10, 2022
Pickling error? General performance	4	538	August 15, 2018

RAM usage increasing linearly with pystan sampling

Related topics