Alternative .csv reader

maintenance
rstan

#21

Lol, this is c++ that’s too messy at the moment to be a PR… this thread was me fishing to see if Ben thought the complexity of some (cleaned up) c++ might be worth the potential speed up in reading cmdstan .csv files in rstan. I think the answer was they thought they could make the current R code faster so I didn’t pursue it.

The code is under ‘github.com/sakrejda/stannis’ under ‘inst/include/zoom*’ and if you install that package you can run it with ‘stannis:::read_cmdstan_csv()’.

Next chance I get to update it I’m going too refactor the abundant typedefs into classes and add an intermediate binary serialization step so that you can access individual parameters without loading the entire sample into memory.


#22

The peak memory usage in R isn’t a very reliable measure for how much memory is required, since it depends on how frequently garbage collection is being done. With my CSV speedups, it did not change peak memory usage much in an unconstrained setting, but it was actually able to run with less available memory (tested by using ulimit to cap memory).

The new code is faster, but not as fast as your code. There is definitely room to speed up the CSV reading still, but also some of the performance difference is the extra work that read_stan_csv does in creating the stanfit object, so I don’t think it would be a 3x speedup.

I looked briefly at extending Rstan to allow for this sort of backend. You could make the entire sample data frame a memory mapped matrix that is stored on disk, and then you wouldn’t have to load anything into memory until it is needed by downstream analysis. The OS can handle loading the relevant pages from disk into memory. So you could process very large stanfit objects with minimal memory overhead.

Unfortunately, the mmap package in R didn’t seem to support creating multiple vectors indexing into different points in a single large mmap object, so it was going to require rather extensive changes to the either the rstan or mmap package to support this.


#23

What else does rstan do? My output is reshaped to be identical (per-parameter arrays with equivalent dimensions to rstan). It’s not also calculating r-hats or something (?)

boost::iostreams looks like it makes this relatively painless so was going to go that route.


#24

Yeah, that’s worth doing but I’ll wait till the Ubuntu I’m on decided to update to the new R since there are supposedly speedups coming :)


#25

It’s not calculating \hat r, but it does a bunch of string munging on the parameter names that seemed to take up a fare amount of time. I don’t remember all the details, but you can see the slow steps using the R profiling tools.

I’m not familiar with boost::iostreams. It may be the same as the mmap format - which is just the raw binary representation of the floats. The cool thing about mmap is that the OS can manage the memory overhead, and modern OSs are really good at this. When the data is needed, the OS loads the page into memory, and if memory is needed by R or another process, the page gets removed from memory.

Compared to writing out to files with streams, I think it would be comparable in the initial loading of the CSV, but it could be cleaner when loading and working on the saved object, since you wouldn’t need to specify in advance which parameters to load - they would just be pulled into memory on demand. I think it can also be made a little cleaner this way because you can have one big disk object for the whole model, rather than a separate file for each parameter.

Yes, R 3.5 implemented buffered input, which makes the scan function run much faster.


#26

Huh, I recall something about that but not why it might be necessary munging.

boost::iostreams is a library, it has mapped files as a feature: https://www.boost.org/doc/libs/1_67_0/libs/iostreams/doc/classes/mapped_file.html


#27

The munging has to do with converting vectors and matrices from the flat format in the CSV file to the proper format for the stanfit class.


#28

that part I do already too so I think it’s
a fair comparison.


#29

Sharing one more update here…

I made a simple binary format and some code to read it neatly. The interface could use more work and error checking but I’m excited to see it works so well, particularly in terms of managing memory requirements. Next step for me is adding a ‘binary’ output argument to CmdStan to write this format directly w/o bothering with .csv

The description of the format is here: https://www.overleaf.com/read/ctpjdwsbrvbw
The package is the same one: https://github.com/sakrejda/stannis

The memory requirements during .csv -> binary conversion are: mostly near-zero, when reshaping each named parameter is loaded so you do need to be able to hold each parameter as a std::vector<double> at a time. That could be fixed with mmap but that’s not ready yet. For the moment the syntax to rewrite a .csv file to binary is (dir must exist):

library(stannis)
run = stannis::read_run(root = path_to_output_csv, uuid = NULL)   

To extract a parameter you pass the directory and parameter name to a function:

theta = stannis::get_parameter('./sample', 'theta')
theta_ = stannis::get_parameter('./sample', 'theta', mmap=TRUE)

Both return an array with correct dimensions (matching rstan) but the second form uses mmap which doesn’t load into memory until you subscript it. The discrepancy is less than 1.1e-16 vs. rstan

The speed was a few times faster than rstan::read_stan_csv on R 3.5 (100 replicates, 30 Mb output.csv). All timings are in seconds:

> mean(timing_stannis)
[1] 2.01799
> mean(timing_rstan)
[1] 5.20083
> sd(timing_stannis)
[1] 0.03693852
> sd(timing_rstan)
[1] 0.09013593
> mean(timing_rstan/timing_stannis)
[1] 2.577687
> sd(timing_rstan/timing_stannis)
[1] 0.04634689

Version info:

R version 3.5.1 (2018-07-02) -- "Feather Spray"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
...

[1] "magrittr"
> library(rstan)
Loading required package: ggplot2
Loading required package: StanHeaders
rstan (Version 2.17.3, GitRev: 2e1f913d3ca3)

Doing the parsing is a bunch of C++ but getting CmdStan to produce binary would be quite a bit shorter.


#30

Is there a distinct advantage to this approach over using something like
Apache Arrow or Feather https://github.com/wesm/feather (which is
Arrow-based and actively developed by Hadley Wickham and Wes McKinney)?

I worry about these custom formats after the experience with the custom
csv format. It would be nice if, say, Python users could simply write
pandas.read_feather('header.bin`) and get something useful back.


#31

Having gone through this I understand the trade-offs a little better:

  1. With this custom format you can use R’s mmap package or python’s (https://docs.python.org/2/library/mmap.html) and read the array and the dims from separate file. That’s the reason I kept it simple (the samples are just doubles and the dims are an unsigned int type (std::uint_least32_t for the moment but I should pick something consistent machine-to-machine).

  2. Because it’s so simple, a very basic mmap package (like R’s) will work fine. A few helper functions can take care of the heterogeneous binary files.

  3. As a consequence you don’t need arrow/protobuf/etc…

  4. Google protobuf is great for cross-platform cross-language support but you can’t mmap it directly so that’s probably something that disqualifies it.

  5. feather is focused on data.frames which we really don’t need… we need arrays (for dims and doubles) + metadata

  6. the arrow C++ API’s don’t have fantastic doc so even if we write something around it I’m concerned at the interface level it would be hard to maintain or help interface devs with. I see it has some tutorials but then you’re down to reading C++ headers and there’s not enough guidance about best practices.

If we had somebody who already knew best practices with Arrow’s C++ API I think it would be an easy change to make (mostly keeping the split-file format).


#32

Could we use netCDF/hdf5 or some other commonly used scientific fileformat and save the samples and other information with their correct dimensions?


#33

We could, but the question is whether it meets our requirements. For me these are:

  1. stability
  2. easy maintenance
  3. cross platform usage
  4. multi-language usage
  5. mmap-friendly (more multi-language issues)
  6. implementable

Last I looked at hdf5 the forums had long standing unfixed critical bugs (data corruption). I think the C drivers were solid but there was some bit rot beyond that. If a project can’t or won’t fix stability bugs I personally don’t want to use it.

I looked at other formats, and the ones I think might not be a maintenance nightmare are Arrow (well known pluses, mmap-friendly, missing good high level doc, I haven’t spent the time to figure out best practices) and capnproto (straight c++, small community, used at cloudflare, really cool design, mmap-friendly, great doc including best practices but it’s in the headers).

Arrow is probably the best bet for getting buy-in and meeting our needs. They’re new enough and have clear goals so we’re less likely to get “It’s a feature not a bug” arguments if we need to submit a PR to fix something critical to us.

The main argument against using arrow is that we don’t really need it…


#34

Would this mean that there would be multiple files for each dataset or they would be inside one file?

Is there a way to add compression for streamed data?

OT: Also my guess is that numpy memmap is more suitable for reading large files than python mmap interface.

https://docs.scipy.org/doc/numpy-1.14.5/reference/generated/numpy.memmap.html


#35

Yes multiple files for each .csv, it makes it super-easy to use mmap (mmap’s offsets are tied to page size so they’re inconveniently large).

I’m relying on boost streams and they have both compression and mmap in their streams interface.


#36

I agree that hdf5 is problematic. There’s been lot of discussion of its
issues. I’ve experienced some of them myself.

I think we should use a standard format even if it comes at the cost of
performance and/or space. So arrow, compressed json/cbor, compressed
protobufs, etc. My main concern is third parties who want to
post-process or analyze the data. If we use a standard format they’re
going to have far easier time loading the data. That is, they will not
need to write their own parser (or load some library that we write).


#37

Is there a relevant programming language that can’t mmap a file of doubles?


#38

Here’s an example of the rock bottom fallback in R: https://stackoverflow.com/questions/26584227/how-can-i-read-float-data-from-a-binary-file-using-r


#39

I’m thinking more of the other features of the data such as the field
names. And, also, asking someone who is not familiar with C or C++ to
write code to deal with binary file such as this is asking a lot! I
think it would be easier if they could just write pandas.read_json
or pandas.read_feather (or whatever the R equivalent is)


#40

Tell you what, I was going to look into Arrow as well so instead of us carrying on here in the usual way, I’ll let you know when I redo this in Arrow. The C++ API is not bad it’s just disconcerting that you have to sniff out what the best practices are.