Alternative .csv reader

ahartikainen · September 4, 2018, 2:17pm

Reading bunch of floats is fastest probably with

with open(filepath, mode="rb") as f:
    arr1D = np.fromfile(f, dtype=np.float64)

Then there is the business of reshaping and used order, F or C.

sakrejda · September 4, 2018, 2:37pm

Cool, I forgot about numpy, the dimensions are in a second file that’s all integers (n_iterations, n_dim, d1, d2, d3…), though as I mentioned before I used std::uint_least32_t which is still variable-width and I should probably stick to just one fixed-width type. Then you could use the same pattern to read the dimensions.

The order is the same order Stan uses internally, except that it’s [n_iterations, d1, d2, d3, …]

That’s all at the overleaf link.

bparbhu · September 5, 2018, 3:47pm

For StanJulia we initially looked at parquet within apache arrow but went with feather instead as there’s no way to write out parquet in Julia currently.

StanFeather.jl within the Julia interface saves out a feather file that is derived from a StanDataFrame

or an a3d array. We want to give the user both ways of saving output so that if they choose to use another interface

for another different reason they can. There might be certain tools written for RStan users that aren’t present within StanJulia and we want

users to have the option of picking a file that is widely used for all 3 languages so they can work within a different interface if the need to for whatever reason.

In terms of parquet support we can also provide that when there’s additional functionality that allows us to write

out parquet files. I’ve used parquet before and I think it’s better than feather especially when it comes to Python support

but for now we’re just going with what’s available and what

will fit our needs.

sakrejda · September 5, 2018, 4:01pm

I only found an empty repo for StanFeather.jl and seeing how you use the Arrow C++ API would be handy. Where is this code?

bparbhu · September 5, 2018, 4:03pm

Its something I’ve been working on for the last couple of weeks. I’ll be pushing it up pretty soon. But the idea is that we will be using feather files and data frames as the main format for the package.

ahartikainen · September 5, 2018, 4:39pm

Hey, for Julia and R people, we are making a general dataformat on top of the netCDF4 with PyMC3 and others on python (ArviZ lib). It would be good to have readers/writers (netCDF4 = multiple nD arrays with named dimensions) also on Julia and R.

aaronjg · September 5, 2018, 9:37pm

Sakrejda, if I correctly understand your implementation, each file is just a binary list of doubles with no metadata. In that case, it’s a pretty bare bones and standard serialization format. In fact, this is probably more “standard” than using an auxillary library such as hdf5, apache arrow, on netcdf.

Pretty much all major languages can handle reading these binary serialized files, so I don’t think this this should be a major concern with the format that @sakrejda is proposing.

Have you tested this with large (>10,000) number of parameters? One issue that I ran into trying put mmap into Rstan was that I exceeded some limit on the number of open file handles. I tried to write all the parameters as a single mmap object, and then open up separate mmap objects with the appropriate len/offsets for each parameter. This opened up a new file handle for each parameter, and then ran into the limit. In principle, you should be able to do it with a single file handle and have multiple objects working off of that, but it quickly got into more engineering than I was willing to get into at the time.

Here is my initial work on the project.

Mmap provides a “len” parameter in addition to “off” so you shouldn’t be constrained to spacing parameters at page size increments.

sakrejda · September 5, 2018, 11:03pm

Yep, that’s what I did (there is a “header” file for formality that identifies the run and that you can stuff free-form comments into and a uuid, etc… but the data itself is a file full of doubles or int’s (for dimensions)

The way I did it all initial processing relies on in input iterator (no going back) so it enforces a single read through a sequential file. You end up with one file of dims and one file of doubles per named parameter so maybe into hundreds if your model is monstrous. Then you can open/close those separately so you’re less likely to run into issues with file handles than if you do each .csv column separately.

Combining everything into one file would require more engineering and I haven’t seen a reason to get into that yet but I would be interested in arguments for doing so.

I thought ‘len’ was for setting where you end rather than where you begin but I’ll take a second look. Honestly the boost::iostreams mmap functionality is pretty straightforward so writing a minimal wrapper that we need is not onerous if the mmap package is too sloppy with file handles to work.

I’ll check out your code, thanks for sharing!

aaronjg · September 5, 2018, 11:34pm

That makes sense. I created a single mmap file and one mmap object per column so that I could easily maintain compatibility with existing rstan functions.

Looking back at my code, I think you are correct. I ended up hacking around the offset with some pointer arithmetic and the new_xptr function lines 165-175 of this commit.

Primarily, I think it will be easier to share the Stan files if it is just one file of metadata and one file per chain of samples. It ensures that random parameters don’t get lost in transferring the models between computers and that you don’t accidentally transfer some of the files into a directory that already has some parameters and end up loading a model that has some parameters from one run and some from an old run.

I found the engineering to be pretty straightforward for creating and reading from the file - when you write a parameter to the file you just write it to mmap_file[iter + param_idx*total_iter] and to read you just read from mmap_file[param_offset*total_iter:(param_offset*total_iter+total_iter)]

It was kind of a pain to handle that pointer arithmetic in the R mmap package and create the correct mmap objects. You can look at how I got it all working on that branch though.

aaronjg · September 6, 2018, 2:09am

A couple of other comments on the earlier psot as well.

This is because the CSV rounds when it rights out, where R stores the full precision. It would be nice if cmdstan or rstan could use the binary format to write directly rather than translate through the csv step.

Could you try the performance tests with the latest rstan develop branch? There are some new speed improvements in the develop branch. I’m sure yours would still be faster, but the difference may not be quite as great.

sakrejda · September 6, 2018, 2:19am

aaronjg
    September 6
A couple of other comments on the earlier psot as well.

sakrejda:
The discrepancy is less than 1.1e-16 vs. rstan

This is because the CSV rounds when it rights out, where R stores the full precision

Actually… this is vs. Rstan’s read_stan_csv, turns out there’s a bunch of ways to do it in c++

It would be nice if cmdstan or rstan could use the binary format to write directly rather than translate through the csv step.

Yeah, that’s really easy and obviously faster all around, that’s the next step.

sakrejda:
The speed was a few times faster than rstan::read_stan_csv on R 3.5 (100 replicates, 30 Mb output.csv). All timings are in seconds:

Could you try the performance tests with the latest rstan develop branch?

Sure, I’ll give it a shot.

Bob_Carpenter · September 10, 2018, 10:46am

+1. I’m not worried about using non-standard data formats. These should all be easily mappable to other formats if people care about them.

It’s so odd to me that over my programming career, not-invented-elsewhere has overtaken not-invented-here as a design goal.

I would strongly prefer not to generate thousands of files for models with thousands of parameters. If something’s memory mapped, it’ll be random access, not one pass only. Single passes are always more efficient, but opening/closing file handles is a relatively expensive operaiton in all OSes I’ve ever used.

I would consider this a requirement. I don’t think we need 16 digits of accuracy, but no reason to not be able to round trip for sanity checks, etc. It’s just all easier to doc and use that way. It’s been a hassle that our output now truncates.

sakrejda · September 10, 2018, 1:12pm

Are you saying there are models with thousands of NAMED parameters?

sakrejda · September 11, 2018, 5:54pm

+1 I think it’s b/c we’re all expected to learn so many tools but there’s not so much draw to learning new languages (I mean it’s fun but usually if you know R/Pyton the main professional benefit is more libraries not learning Julia or the ins and outs of C++ streaming).

BTW, I’ll try to address the suggestions @Bob_Carpenter and others made this weekend by checking out what things would look like if I:

used CmdStan to write binaries directly
tried to push all this into a single file while maintaining the possibility of easy mmap-based reads.
tried single-pass writes (this would mean pre-chunking output but that seems to be the standard for columnar analytics anyway, it’s what Apache arrow does, it’s what Capnproto suggests, etc…

I’m considering trying with Apache arrow but I’m still not convinced there’s a real benefit… IDK.

Topic		Replies	Views
"ServerStan" implementation language poll Developers	44	2164	August 21, 2020
Lightweight interfaces - keeping it light Developers	21	1703	September 18, 2020
Speed difference between rstan and cmdstan for a simple model CmdStan rstan , techniques	25	3603	November 7, 2021
CmdStan dataset loading speed Developers cmdstan	21	1426	February 14, 2020
Cmdstanr fails to read its own csv files for large number of parameters CmdStan cmdstanr	21	2280	August 9, 2022

Alternative .csv reader

Related topics