CmdStan dataset loading speed

@Erik_Strumbelj noticed that CmdStan takes quite some time to load large datasets. After checking with a profiler I found out that the bottleneck are conversions from strings to doubles.

We could avoid these conversions if some binary file format was used. We could either specify our own (maybe 1 binary dump file per vector/matrix?) or use something like hdf5.

What are your thoughts @Bob_Carpenter, @seantalts, @syclik ?

I suspect that parsing speed can be increased a lot. There is, for
example, a massive difference between using an optimized json reader and
a more generic one. I believe one observes this on multiple platforms.

As for binary formats, there has been some discussion about defining a
protocol buffer schema for draws. We couldn’t reach agreement about what
it should look like. PyStan 3 uses the following protobuf schema:

I think @mitzimorris would be the most knowledgeable on how to tackle this. I think we also previously discussed protobuf (there’s also cap’n proto from the guy who made protobuf)

https://capnproto.org/cxx.html

I doubt parsing speed can be significantly increased. Profiling identified vast majority of time is spent in conversions to doubles. These are done using boost::lexical_cast that claims to be faster than sscanf, which I would otherwise default to. If we wanted we could offload these conversions to GPU, but it would not be simple.

what is the install burden to the user for hdf5? I tried to install NetCDF on Mac OSX and it was a nightmare took some time. presumably users who need this feature can handle the install, correct?

I haven’t used it before but there seem to be binary packages for many systems. Anyway we can make it an optional feature, so users that need this may need to spend some time with installation, but everybody else can ignore it.

We already have a json option, right? Why not go with binary json (or msgpack)?

cc. @seantalts did you experiment with these?

edit. How long does it take to load data vs run a model that has huge data?

btw is lexical_cast really fastest way to create a double?

I just benchmarked to find out that it was slower than necessary and that using ujson would be nearly as fast in one case - see

https://github.com/stan-dev/stan/issues/2776

And

https://github.com/stan-dev/cmdstanpy/issues/38

My immediate thought is, “Here we go again.” This has been a perennial topic of discussion with no resolution of which format or schema within a format to use. I believe protocol buffers are the current top runner for technology. We need something that supports efficient streaming input and output.

I think the only way this will be cracked is if someone takes a lead in a design based around a common standard.

We wrote our own JSON reader based on callbacks for efficiency (it doesn’t build a parse tree as most of the existing packages do). But it just uses built-in string to primitive number conversions.

I measured the straight-up I/O cost of converting strings. It’s about a factor of 100 more expensive than binary I/O. It can be even more at full precision (about 20 ASCII characters per number with exponent if everything’s fractional) or less if you’re dealing with small integers. At full precision, ASCII isn’t very compact. We could add compression at probably very little cost or even a win in some places as it cuts down on raw disk I/O.

Thanks for profiling. Are they using a custom or faster ascii to double converter? I also asked on the issue about some other things and don’t want to just duplicate that discussion here.

I have notes on two GLMs (bernoulli and normal) with 1 million observation and k = 50. In this cases the input data.R file was ~1GB.

For the normal GLM on the CPU that ran for 270s, 40s of this time falling on I/O, almost all of it on reading the input data file.

15% is a notable amount of time I would say. If/When we manage to paralellize it with TBB with a 4x speedup, the I/O quickly becomes 40% of execution time.

The bernoulli GLM runs for 740s so the 40s of I/O is less of a problem, but its still 5%.

Normal GPU GLM on the Radeon VII runs for 55s, so 70% of execution time is spent on I/O.

The bernoulli GPU GLM runs for 95s, meaning that 40% of its execution time is again spent on I/O.

2 Likes

Agreed, I think this is an issue. @ahartikainen convinced me to look into Python’s ujson library and it came out nearly as fast as msgpack and bson in some simple tests, so I think we could eat a ton of low-hanging fruit by just using some fast 3rd party JSON parsing library even if we don’t offer a binary format.

2 Likes

I looked over suggestions here and in linked issues. In summary they are: ujson, BSON, MsgPack, Protocol Buffers, netcdf4, hdf5, Avro, Thrift, FlatBuffers, Cap’n Proto.

I think the need to specify schema and compile it into code for saving and loading files is unnecessary extra work in stan. That is a downside of Protocol Buffers, Thrift, FlatBuffers and Cap’n Proto.

I don’t think simply using another json parser can offer nearly as much speedup as a binary format so that is a minus for ujson.

Remaining options (BSON, MsgPack, netcdf4, hdf5 and Avro) all seem fine at first glance. Any preferneces? If there is a need I can write a benchmark to compare their performance.

1 Like

I think CBOR should be on the list. It’s more of a standard than MsgPack
or BSON. ujson is a library, not a format, right?

1 Like

True. This could be tested against stan json reader

1 Like

Adding on to json are these two

Seems easy to plugin

Seems V fast

1 Like

Right. The bottleneck operations no matter what you do are:

  1. ASCII to double and vice-versa,
  2. file I/O
  3. conversion to Python data structures

There’s no way to make the ASCII version of (1) competitive for full-precision floating point. I imagine a lot of speed differences in Python libs would come from (3).

That’s in line with what I saw four or five years ago when I was evaluating how to best code linear regression in a Stan model, but I was running into cases where the I/O time dominated the fit time using optimization.

I don’t do enough I/O optimization to have a sense in how much of that 40s can be chopped off with a better format.

Neither do I.

My hunch is just that reading a GB of ascii data (or equivalent in any other format) should be faster than 40s on a modest computer.

Many of those binary formats support saving a raw binary blob, but not something that is exactly a matrix or array of variables of same type. Is there any reason we would not want to use that for storing vectors and matrices? Since data is usually stored and loaded on the same machine (and big endian PCs are practically non-existant) endianness should not be a problem. We only have to specify that data must be stored column mayor. Or is there anything else I am missing?

Maybe this would be’ useful?