After using brmspy in various environments, I have found some serious pain points left that makes it both brittle and unpredictable.
All main ones are simply unexpected crashes that either comes from segfaults in R itself or the rpy2 bridge that the library uses to make R calls. Due to the way that the R session is embedded within the Python context, this means that the Python environment will crash in an unrecoverable way too. In some cases it even leads to IDE crashes. While my choice to force ABI mode did make it better, it’s far from perfect and at least for me, it is not production grade and requires careful consideration of retry strategies in my AWS pipelines.
The lack of isolation for R is also the main culprit in brittle tests and need for serious OS-specific workarounds that I have built into brmspy too. Plus it makes the R session effectively unchangeable after you have imported some packages.
Since there’s no way I will be able to solve every unpredictable behaviour of R, rpy2 or various OS quirks, I took a step back to think on how the R session could be isolated without causing significant overhead.
Right now the architecture can be thought of as a single process of Python + R (Embedded). A crash in one immediately leads to problems in the other. So the logical choice seems to be to have 3 nodes. Main (py) + Worker (py) + R (embedded). The very obvious problem with this is the immediate overhead of memory use and data copying. The minimal requirements for such a setup would be that it uses no more memory than the two node architecture.
I have an experiment branch where I have managed to get the idea fully working.
The 3-node idea has these core features:
- Uses shared memory to minimise the amount of data copying done and avoid any memory use peaks. This means we operate on same memory use assumptions as with direct rpy2 use
- A module session that proxies regular-looking brms module calls to a worker (the IDE doesn’t know the difference). This allows doing the refactor without any significant API rework
- Encoders/decoders for rpy2 that never copy data into the “workers Python heap”, they always copy Matrices, DataFrames and other R data type buffers directly into shared memory. No transformatsions will be done on matrices for example, we can reconstruct the buffer without transforming it into a numpy shared memory array.
- Uses a context manager to make any changes to the R session. The R session can have its packages removed, reinstalled, R versions hotswapped etc with no side effects on any OS or a need to restart Python.
- Environments and runtimes are isolated from one another. Runtimes will never be mutated by brmspy, they can be reused in environments.
A major side effect of this is that since the main processes idiosyncrasies no longer indirectly affect the Python process running our embedded R, previously easy-to-happen segfaults have gone down significantly. Loo functions for example cause zero issues with this setup.
Examples on how the WIP process for managing an R environment looks like:
# stops the current session, starts it fresh. Automatically detects R_HOME, LD_LIBRARY_PATH etc, sets an isolated location for installing user managed packages.
with brms.manage(environment_name="mrp") as ctx:
# downloads cmdstan/brms/rstan into ~/.brmspy/runtime/{fingerprint}/
# adds runtime to libPaths
ctx.install_brms(use_prebuilt=True)
# installs MCMCglmm into ~/.brmspy/environment/{environment_name}/Rlib
ctx.install_rpackage("MCMCglmm")
# lets assume for some reason we want to switch to R4.4
# stops the current session, starts it fresh. Automatically detects LD_LIBRARY_PATH etc
with brms.manage(environment_name="legacy", r_home="path/to/r/4.4") as ctx:
ctx.install_brms(use_prebuilt=True)
After these two steps, on macos, the folder structure for environments and runtimes would look like this:
.brmspy
├── environment
│ ├── default
│ │ ├── config.json
│ │ └── Rlib
│ ├── mrp
│ │ ├── config.json
│ │ └── Rlib
│ └── legacy
│ ├── config.json
│ └── Rlib
│ └── MCMCglmm
├── environment_state.json
├── runtime
│ ├── macos-arm64-r4.5-0.2.0
│ │ ├── cmdstan
│ │ ├── hash
│ │ ├── manifest.json
│ │ └── Rlib
│ └── macos-arm64-r4.4-0.2.0
│ ├── cmdstan
│ ├── hash
│ ├── manifest.json
│ └── Rlib
└── runtime_state.json
This whole process “just works” thanks to isolating the R session.
To illustrate the roundtrip a bit better, this is what it looks like for the brm function:
main:
-> brm(fit)
-> RModuleSession.__getattribute__ - checks if the module has the function, then caches its call if it does
-> RModuleSession._call_remote
-> codec.encode - for args and kwargs. Puts numpy, arviz, pandas and xarray objects into shared memory buffers that require no transformation to reconstruct. If already on shared memory, just uses the shared memory information
-> R objects are stored on worker side and exhanged as simple SexpWrapper(rid, repr) from and to main side (formula, priors in this case).
-> ships only the metadata and memory addresses to worker
worker:
-> codec.decode - for args and kwargs. reconstructs objects from buffers without transformations. Zero copy.
-> gets Sexp (R objects) from Sexp cache by rid (formula, priors).
-> brms.brm(fit) - calls embedded R's brms::brm().
-> For any R Matrix, DataFrame returned, r_to_py(obj) copies buffers DIRECTLY into shared memory for reconstruction as Python objects. Respects column-major, row-major etc.
-> Create arviz objects using existing buffers
<- back in worker main, turns any Sexp objects into SexpWrapper
<- codec.encode - stores metadata needed for reconstruction and shm information
<- sends minimal data representation back to main (e.g 3 512mb matrices would be a couple hundred bytes)
main:
<- codec.decode - reconstructs arviz objects from received buffers. No copying is done again, the underlying data points to the same memory adresses as in worker
<- Result of codec.decode is given to the user
There is some metadata construction and JSON parsing overhead, but it’s minimal and close to unnoticable compared to the actual time it takes to run large majority of brms functions. Since the codecs and rpy2 converters are per R or python type, we don’t need extensive logic for every function. With a few codecs and converters, the majority of data exchanged with R is already automatically covered. For any special or uncovered types, it does default to pickle, but nothing larger than 1mb will ever be passed in normal use cases.
Keeping memory use low is critical for my use cases, as sometimes the posteriors I work with are around 5gb. Doing any extra copies on that will be very expensive for any automated pipeline.
Hence I might even need to introduce parameters to copy the R data in a chunked manner, meaning that once a chunk is copied, it is partially removed from memory on R side from a matrix or dataframe and the Sexp output from brm() or any other wrapper will be NULL. GC would need investigating for this case obviously, not a priority at the moment.
Also, the library stays usable with no extra worker process when BRMSPY_WORKER="1" env var is set. More unstable and I see no point in directly using it in that mode, but it’s there.
I don’t think I’ll be shipping this for a couple more days, there’s a ton of edge cases I want to try with the experimental architecture. But so far, it’s shockingly stable and I haven’t even seen worker crashes.
This might look like over-engineering 101 :) , but I finally have some OSS time and I want to use brms from Python without worrying that some edge-case will blow up my session