For monitoring samples during sampling (as here), it would be useful to use a file format for the sample_file that has fast access and includes header information on how many samples are in the file so far. I’ve used hdf5 for other projects and I think it would be perfect here, especially where it has a “single-writer multiple-readers” (swmr) mode for files that would permit one process to do the writing while another monitor process keeps tabs on the updating contents. If this is explored, I also suggest using blosc for fast compression, as I’ve found it works well with hdf5. Unfortunately all my work with both hdf5 and blosc are from their python interfaces and I don’t know C++ well at all, else I’d have tried my hand at adding them to Stan myself.
We talked about hdf5 (search the dev list, maybe the wiki) and therea re issues with stability unless (and maybe even if) we do a solid C implementation.
Hm, searched both here, the old stan-dev google group, and the github wiki pages, and I don’t see anything talking of stability issues with using hdf5.
Oh, I think I found it. This?
I guess there’s more that’s not documented—I looked at the C++ libraries
for it and wasn’t impressed, mostly I saw things that are clues to a
floundering project (stability issues that are on a long-term fixme list
and not getting fixed, mailing list messages with people having stability
issues and responses like ‘hey, we’d like to fix that, soon’. I’m not
saying it’s not worth looking again, just remembering what put me off. I
didn’t have a lot of time to look into it so it was more of a best guess at
the viability of the project rather than a serious evaluation.
We are looking at having a protobuf binary format that should be fast to
read incrementally.
I (think) I see that protobuf is being used in httpstan now; any timeline for getting it into the core library and/or rstan?
You might want to start a separate topic.
Good call. Will do.