I am experimenting with a large data set with around 770 million observations and several thousand parameters. Using cmdstan, it looks like I have to write the data to disk and read stan’s output as ascii-formatted numbers (csv of json). While not terrible, I am looking to save some memory by getting data in/out of the model in a more efficient binary format (e.g. npy file) or, better yet, by calling into the stan library without serializing to disk.
What are the most memory efficient stan interfaces (that run on Linux)? It looks like rstan and pystan can both interface directly with the stan library (although it appears pystan now has a json layer of indirection through httpstan). Are there other recommendations? Any insights regarding an interface that is clearly more ergonomic or widely adopted/supported?
If by memory you mean RAM, cmdstan’s writing to disk is probably the best you can do, since the in-memory interfaces like py/RStan would be allocating the entire draws storage in memory at once. If disk space is a concern but not a hard limit, post-processing the csv file into something more fitting for your use case is a valid option
Any suggestions for getting the data into stan without serializing to disk first?
PyStan serializes to json but keeps that json object in memory. RStan re-uses the memory of the R process for the data.
I don’t know if there is a way to have the data passed in-memory and the samples written out-of-memory for any of our interfaces. I believe RStan has an option to write the samples to disk, but I’m not sure if this also allocates space for them in memory or not.
There is always a possibility to use pure httpstan if really needed (and I’m not suggesting it would a good idea :) ), but I would assume one could stream data from httpstan to some binary writer if needed.
But CmdStan with parquet support is probably best option.
This sounds like exactly what I need, but I thought CmdStan only supported csv, json, and rdump. Is there a version of CmdStan with Parquet support?
Not currently. This is something which has been proposed and accepted but has not been implemented
Can you tell more about your data and model? Maybe you could use some other software that would support distributed computing? With that ratio of observations to parameters, for many models the posterior would be very narrow and maybe MCMC is not needed? Or maybe you can aggregate and use summary statistics approach?
I’ve got an experimental Python interface/implementation of Stan, using BridgeStan, which for every iteration returns a numpy array containing a draw from the target distribution. With a numpy array you could do whatever, store it in a npy file, update some online calculations of means, variances, or quantiles, or whatever else Python has to offer with a numpy array.
It won’t be as fast as CmdStan, since iteration happens in Python instead of C++. But all leapfrog steps/trajectories happen in C++, so for a model of the size you mention, where gradient calculations are likely to dominate computation time, it shouldn’t be too far off (I think/hope).
I’m still evaluating the implementation for accuracy and speed, but if you’re willing to experiment with me, send me a direct message.
@roualdes, it was great meeting you at StanCon, and I’m happy to see you are still hacking on BridgeStan! In the long term I am really excited about BridgeStan as a language-agnostic C-style API/ABI that will, hopefully, offer all of the same features as CmdStan. I recall that back in June 2023 there was still a lot of fiddling required to do MCMC sampling using BridgeStan. It sounds like you’ve figured a lot of that stuff out since then. Is there an example of sampling from the target distribution on the BridgeStan GitHub I should check out?
For now, I am working with rstan as the only “ready to use” Stan interface with zero-copy in-memory data, albiet with the potential for serious memory limitations when drawing a large number of samples (due to keeping all samples in memory).
I’ve been separately hacking away on a project I’ve tentatively named ffistan which uses BridgeStan-like C apis for the Stan services layer (e.g. pathfinder, sampling, optimization). It’s essentially a language-agnostic way to create in-memory interfaces a.la py/rstan. In principle there’s no reason you couldn’t add this functionality to BridgeStan itself, but for separation of concerns it is nice
I’ve recently been testing it with memory mapped output arrays, which was pretty sucessful, so in practice it is also reasonable to avoid the “all samples in memory” problem at least in Python and Julia
I wouldn’t say it’s quite prime-time ready but if you wanted to give it a try I’d love to hear how it works out for you
@WardBrian, ffistan looks cool, I’ll definitely check that out! (Although I can’t promise that I have the developer time to give you serious feedback at this stage.) Out of curiosity, is your intent with ffistan to write a C-compatible API/ABI for FFI with languages like Python, Rust, etc.? Or are you planning to write a more stable/better documented C++ API around the CmdStan internals that someone else can lower to a C ABI?
I have written a C-level API around Stan’s internals (whether it is better or better documented is up to the reader), and some “clients” in different languages (currently Python, Julia, and R)
I am separately interested in writing alternative C++ APIs in Stan itself
I’ve got similar caveats as Brian: I’m still hacking, not in BridgeStan for separation of concerns, not prime-time ready. Nonetheless, here’s a test file from my repository experimentalHMC. There’s much more boilerplate there for testing and what not, but you should be able to find the construction of a Stan model and the sampling iterations on the model. Let me know if you want me to help you fit a model with it. Cheers