Usage of Arrow with Stan

I recently did some work to allow CmdStan to output files readable in the arrow IPC stream format (they can be read before they’re fully written, including from R so that’s nice), primarily b/c my output files get big and, well, we’ve talked about .csv problems a lot on here.

My changes to CmdStan were relatively small and only cover the HMC algorithm so my goal was not to merge into the dev branches. Main question: do you have work incorporating using arrow formatted files for Stan output you would be willing to share here?

If there’s any interest the two branches I have are here:

If you diff the relevant develop branch, you can see the cmdstan changes in particular are very hacky but I think the stan changes are a reasonable use of the arrow C++ API. I’m currently living in the stone ages of Stan so I’m just calling CmdStan directly from R with system2 .

7 Likes

Pinging @mitzimorris

We hope to use Arrow in the future, a discussion of which can be found in our design doc repo. I think it is super cool that you have a working prototype for this!

1 Like

thanks @sakrejda! very useful - especially arrow_ipc_writer.hpp

3 Likes

Just now seeing this, super cool! What is the overall structure of how you’re writing things out? Just a single table as with the CSVs? Warmup and samples in the same table? Are the step-size and metric written somewhere separately? Any metadata?

Plenty of metadata, I’ll post a snippet that munges some of the output, the output goes into a few files, messages with time stamps in one, samples in another, and a separate file that tells you the indexing into columns and named parameters.

Glad it helps!

Nice, the output design is over of those long-suffering 😅 parts of the project so I’m glad it’s continuing to get more attention! I’ll check it out

p.s.-if you do end up doing more design/implementation and ping me I’d be happy to help with that, the design doc looks pretty close to the choices I made

1 Like

Hey to give you an idea of the format, you can see the table metadata in the arrow_ipc_writer.hpp file Mitzi mentions above. The actual schema for the three tables are here: stan/arrow_ipc_writer.hpp at 9436b089abfa1166946f8e090293b290f96247d5 · sakrejda/stan · GitHub