Issues with PyStan, multiprocessing and subsequently the PyStan3 pre-release on Mac OS Catalina

I have been encountering a lot of cases where my PyStan code crashes on Mac OS Catalina, spawning a ‘ReportCrash’ and printing out the following repeatedly:

libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Couldn’t close file
libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Couldn’t close file
libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Couldn’t close file

I traced this through a similar issue with another python package to conclude it is an issue with some combination of the latest Mac OS and python’s multiprocessing package. And further to the fact that PyStan3 doesn’t use this package and so should solve my problem. Now, for my work I am using the same models with different data configurations and I cannot seem to figure out how to instantiate a model without the data immediately as I can in the normal release of PyStan, has this functionality been removed?

You don’t need to do that, model compilation is done once and the other rounds read the model (without data) from cache.

Ah I see, so calling build on each iteration should not affect my performance, thank you for the quick response.

Do you happen to know how I can emulate return fit.extract(permuted=True) behaviour?

You don’t want to permute your samples. Just use all if your samples when doing further work.

ArviZ also supports pystan3, read the docstring for adding the model to from_pystan. InferenceData can be saved to netCDF format.

https://arviz-devs.github.io/arviz/generated/arviz.from_pystan.html#arviz.from_pystan

Edit. You can also do fit['y'] so no need to call extract.

The reason I ask is to avoid having to change my entire pipeline, but before I go further I think I should maybe give more context on what I am trying to do. I’d like to run the equivalent of the following, though n_jobs does not exist in PyStan3 so I am not sure how to setup sample to run across multiple cores:

fit = model.sample(num_warmup=warmup, num_samples=num_iter, num_chains=chains, n_jobs=cpu_count, verbose=True)

Do you know of a better solution to the errors caused by Python’s multiprocessing that might require less reworking of my code, how can I still utilise multiprocessing style computation with PyStan3 / some other approach? Is there any current (consistently) working means to use Stan across multiple cores on Mac OS?

PyStan3 uses threading to parallize over multiple cores. So it is already doing multiple cores (I think)

If you want just dictionary try

fit.to_frame().to_dict()

or just use the DataFrame.

Edit. I think to control num treads use (I need to verify this)

import os
os.environ["STAN_NUM_THREADS"] = "4"

Thank you, I have managed to get the fit integrated correctly. I am having some pretty serious performance issues though; I don’t seem to get any console output using build, is that intentional? It is making it hard to verify whether multiple cores are being utilised or not, as in activity monitor this doesn’t seem to be the case but it is hard to be sure. I appreciate that this version of Stan is not meant for general use and again am open to trying other avenues if you have any suggestions for running experiments reliably on Mac OS with normal PyStan.

Currently I pass x models into a function one after another with the same data, then generate new data and pass in the same x models again, will caching work correctly here?

Finally traced back the issue to the matplotlib library and think I have fixed it for good by editing some code on their end, will raise an issue on their project to try and get this fixed as I presume this is affecting any package that leverages multiprocessing on Mac OS Catalina.

1 Like

@hwilde (or others), is it possible to get some more information about this fix? It’s making things very difficult for some runs on my machine.

Can you try to run pystan in virtual environment without matplotlib installed?

Didn’t even need to do that – just not importing/calling matplotlib seems to be enough to “fix” the problem. But this is not a good fix, since I really need to plot my results…

Can you import matplotlib after your sampling is done?

This of course does not help for notebook work, but if you use scripts, then it should.

See the issue here https://github.com/matplotlib/matplotlib/issues/15410

Ok, thanks. But is there in fact a matplotlib fix that can be applied? The problems affect plotting there in addition to Stan issues.

I noted down this change to myself when I managed to fix it:

Need to edit matplotlib 's font_manager.py to get multiprocessing to work correctly on Mac OS Catalina. Change

if hasattr(os, "register_at_fork"):
    os.register_at_fork(after_in_child=_get_font.cache_clear)

To:

if hasattr(os, "register_at_fork"):
    os.register_at_fork(before=_get_font.cache_clear)
1 Like

That seems to work; thanks. Has this been communicated to the matplotlib team?