CmdStan dataset loading speed

tadej · July 3, 2019, 9:45am

@Erik_Strumbelj noticed that CmdStan takes quite some time to load large datasets. After checking with a profiler I found out that the bottleneck are conversions from strings to doubles.

We could avoid these conversions if some binary file format was used. We could either specify our own (maybe 1 binary dump file per vector/matrix?) or use something like hdf5.

What are your thoughts @Bob_Carpenter, @seantalts, @syclik ?

ariddell · July 3, 2019, 2:16pm

I suspect that parsing speed can be increased a lot. There is, for
example, a massive difference between using an optimized json reader and
a more generic one. I believe one observes this on multiple platforms.

As for binary formats, there has been some discussion about defining a
protocol buffer schema for draws. We couldn’t reach agreement about what
it should look like. PyStan 3 uses the following protobuf schema:

github.com

stan-dev/httpstan/blob/master/protos/callbacks_writer.proto

// Messages describing output of writers defined in ``stan::callbacks``.
syntax = "proto3";

package stan;

// WriterMessage is a data format for all messages written by the callback
// writers defined in stan::callbacks.  These writers are used by the functions
// defined in stan::services. For example, stan::services::sample::hmc_nuts_diag_e
// uses one logger and three writers:
// * `logger` Logger for informational and error messages
// * `init_writer` Writer callback for unconstrained inits
// * `sample_writer` Writer for draws
// * `diagnostic_writer` Writer for diagnostic information
//
// WriterMessage is a format which is flexible enough to accommodates these
// different uses while still providing a highly predictable structure.
//
// A WriterMessage contains a key-value store (features), where each key
// (string) maps to a Feature message (which is either a list of strings, a
// list of doubles, or a list of integers).

This file has been truncated. show original

stevebronder · July 3, 2019, 2:41pm

I think @mitzimorris would be the most knowledgeable on how to tackle this. I think we also previously discussed protobuf (there’s also cap’n proto from the guy who made protobuf)

https://capnproto.org/cxx.html

tadej · July 3, 2019, 2:43pm

I doubt parsing speed can be significantly increased. Profiling identified vast majority of time is spent in conversions to doubles. These are done using boost::lexical_cast that claims to be faster than sscanf, which I would otherwise default to. If we wanted we could offload these conversions to GPU, but it would not be simple.

mitzimorris · July 3, 2019, 3:01pm

what is the install burden to the user for hdf5? I tried to install NetCDF on Mac OSX and it was ~~a nightmare~~ took some time. presumably users who need this feature can handle the install, correct?

tadej · July 3, 2019, 3:24pm

I haven’t used it before but there seem to be binary packages for many systems. Anyway we can make it an optional feature, so users that need this may need to spend some time with installation, but everybody else can ignore it.

ahartikainen · July 3, 2019, 3:34pm

We already have a json option, right? Why not go with binary json (or msgpack)?

cc. @seantalts did you experiment with these?

edit. How long does it take to load data vs run a model that has huge data?

btw is lexical_cast really fastest way to create a double?

seantalts · July 3, 2019, 4:30pm

I just benchmarked to find out that it was slower than necessary and that using ujson would be nearly as fast in one case - see

https://github.com/stan-dev/stan/issues/2776

And

https://github.com/stan-dev/cmdstanpy/issues/38

Bob_Carpenter · July 3, 2019, 4:37pm

My immediate thought is, “Here we go again.” This has been a perennial topic of discussion with no resolution of which format or schema within a format to use. I believe protocol buffers are the current top runner for technology. We need something that supports efficient streaming input and output.

I think the only way this will be cracked is if someone takes a lead in a design based around a common standard.

We wrote our own JSON reader based on callbacks for efficiency (it doesn’t build a parse tree as most of the existing packages do). But it just uses built-in string to primitive number conversions.

I measured the straight-up I/O cost of converting strings. It’s about a factor of 100 more expensive than binary I/O. It can be even more at full precision (about 20 ASCII characters per number with exponent if everything’s fractional) or less if you’re dealing with small integers. At full precision, ASCII isn’t very compact. We could add compression at probably very little cost or even a win in some places as it cuts down on raw disk I/O.

Thanks for profiling. Are they using a custom or faster ascii to double converter? I also asked on the issue about some other things and don’t want to just duplicate that discussion here.

rok_cesnovar · July 3, 2019, 7:27pm

I have notes on two GLMs (bernoulli and normal) with 1 million observation and k = 50. In this cases the input data.R file was ~1GB.

For the normal GLM on the CPU that ran for 270s, 40s of this time falling on I/O, almost all of it on reading the input data file.

15% is a notable amount of time I would say. If/When we manage to paralellize it with TBB with a 4x speedup, the I/O quickly becomes 40% of execution time.

The bernoulli GLM runs for 740s so the 40s of I/O is less of a problem, but its still 5%.

Normal GPU GLM on the Radeon VII runs for 55s, so 70% of execution time is spent on I/O.

The bernoulli GPU GLM runs for 95s, meaning that 40% of its execution time is again spent on I/O.

seantalts · July 4, 2019, 5:10pm

Agreed, I think this is an issue. @ahartikainen convinced me to look into Python’s ujson library and it came out nearly as fast as msgpack and bson in some simple tests, so I think we could eat a ton of low-hanging fruit by just using some fast 3rd party JSON parsing library even if we don’t offer a binary format.

tadej · July 5, 2019, 10:24am

I looked over suggestions here and in linked issues. In summary they are: ujson, BSON, MsgPack, Protocol Buffers, netcdf4, hdf5, Avro, Thrift, FlatBuffers, Cap’n Proto.

I think the need to specify schema and compile it into code for saving and loading files is unnecessary extra work in stan. That is a downside of Protocol Buffers, Thrift, FlatBuffers and Cap’n Proto.

I don’t think simply using another json parser can offer nearly as much speedup as a binary format so that is a minus for ujson.

Remaining options (BSON, MsgPack, netcdf4, hdf5 and Avro) all seem fine at first glance. Any preferneces? If there is a need I can write a benchmark to compare their performance.

ariddell · July 5, 2019, 2:18pm

I think CBOR should be on the list. It’s more of a standard than MsgPack
or BSON. ujson is a library, not a format, right?

ahartikainen · July 5, 2019, 6:10pm

True. This could be tested against stan json reader

stevebronder · July 5, 2019, 10:28pm

Adding on to json are these two

Seems easy to plugin

Seems V fast

Bob_Carpenter · July 8, 2019, 6:51pm

Right. The bottleneck operations no matter what you do are:

ASCII to double and vice-versa,
file I/O
conversion to Python data structures

There’s no way to make the ASCII version of (1) competitive for full-precision floating point. I imagine a lot of speed differences in Python libs would come from (3).

Bob_Carpenter · July 8, 2019, 6:53pm

That’s in line with what I saw four or five years ago when I was evaluating how to best code linear regression in a Stan model, but I was running into cases where the I/O time dominated the fit time using optimization.

I don’t do enough I/O optimization to have a sense in how much of that 40s can be chopped off with a better format.

rok_cesnovar · July 8, 2019, 8:25pm

Neither do I.

My hunch is just that reading a GB of ascii data (or equivalent in any other format) should be faster than 40s on a modest computer.

tadej · July 18, 2019, 7:28am

Many of those binary formats support saving a raw binary blob, but not something that is exactly a matrix or array of variables of same type. Is there any reason we would not want to use that for storing vectors and matrices? Since data is usually stored and loaded on the same machine (and big endian PCs are practically non-existant) endianness should not be a problem. We only have to specify that data must be stored column mayor. Or is there anything else I am missing?

fabio · February 14, 2020, 2:01pm

Maybe this would be’ useful?

Topic		Replies	Views
CmdStanPy - ready for beta testing! Developers pystan	23	2170	August 6, 2019
Most memory efficient stan interface? Interfaces	13	637	December 5, 2023
Notes on Stan Output Serialization Options (YAML, Protobuf, Avro, CBOR) Developers	13	3174	July 14, 2021
Rstan vs Pystan on macOS Catalina General performance	4	1388	March 6, 2020
Various observations on rstan, cmdstanr, pystan and cmdstanpy after teaching with all in parallel Interfaces pystan , rstan , cmdstanr , cmdstanpy	26	3451	October 12, 2021

CmdStan dataset loading speed

Related topics