I recently did some work to allow CmdStan to output files readable in the arrow
IPC stream format (they can be read before they’re fully written, including from R so that’s nice), primarily b/c my output files get big and, well, we’ve talked about .csv problems a lot on here.
My changes to CmdStan were relatively small and only cover the HMC algorithm so my goal was not to merge into the dev branches. Main question: do you have work incorporating using arrow
formatted files for Stan output you would be willing to share here?
If there’s any interest the two branches I have are here:
If you diff the relevant develop
branch, you can see the cmdstan changes in particular are very hacky but I think the stan
changes are a reasonable use of the arrow
C++ API. I’m currently living in the stone ages of Stan so I’m just calling CmdStan directly from R with system2
.
7 Likes
Pinging @mitzimorris
We hope to use Arrow in the future, a discussion of which can be found in our design doc repo. I think it is super cool that you have a working prototype for this!
1 Like
thanks @sakrejda! very useful - especially arrow_ipc_writer.hpp
3 Likes
Just now seeing this, super cool! What is the overall structure of how you’re writing things out? Just a single table as with the CSVs? Warmup and samples in the same table? Are the step-size and metric written somewhere separately? Any metadata?
Plenty of metadata, I’ll post a snippet that munges some of the output, the output goes into a few files, messages with time stamps in one, samples in another, and a separate file that tells you the indexing into columns and named parameters.
Nice, the output design is over of those long-suffering 😅 parts of the project so I’m glad it’s continuing to get more attention! I’ll check it out
p.s.-if you do end up doing more design/implementation and ping me I’d be happy to help with that, the design doc looks pretty close to the choices I made
1 Like
Hey to give you an idea of the format, you can see the table metadata in the arrow_ipc_writer.hpp
file Mitzi mentions above. The actual schema for the three tables are here: stan/arrow_ipc_writer.hpp at 9436b089abfa1166946f8e090293b290f96247d5 · sakrejda/stan · GitHub