I was setting up rstan on a HPC but got V8 installation problem, so I turned to cmdstanr which was installed smoothly. It was boring while waiting for the installation so I tested the same simple model shown below with N=1,000 on my laptop, with rstan (stan 2.26.1) it took 1s, but with cmdstan (stan 2.28.1) it took 3s. When N=10,000, rstan took 35s while cmdstan took 52s. I wonder if this is because of the new diagnostic features in cmdstan? The difference seems not small, maybe I missed something and this difference should be the case. Thanks.
data {
int N;
}
parameters {
vector[N] a;
}
model {
a ~ normal(0,1);
}
@rok_cesnovar@stevebronder Are we expecting 2.28.1 (with CmdStan) to be slower than 2.26.1 (with rstan, presumably the experimental branch since the release isnāt up to 2.26.1 yet)? If so, is it supposed to be this much slower?
Thanks for your reply @jonah. To be fair enough, I did not set seed when launching this, but I guess it may not influence the speed that much (maybe I am wrong, then the mystery is solved). When I increased N to 1M, it took 10396s for rstan and 19553s for cmdstanr (four chains without setting the same seed).
I think a simple model like this would be dominated by printing the output to the CSV file. Increasing N just means more output, and normal_lpdf is a very simple function.
If it is, its definitely a suprise to us. We have done quite some performance testing as we do each release and nothing popped up. 2.27 added back range checks, which can add up to 10% if not disabled, but this model is fully vectorized.
Iām curious, have you looked at data.tables fwrite c code to see how they write like gb of data really fast? Are there optimizations to write to disc? Should we be testing .arrow or .feather outputs instead of csv?
It might be worth running the example in with a profiler and measuring which part(s) seem to be slow.
I had a go at trying to profile this myself, using cmdstan (but not rstan), with linux perf:
environment
host machine: linux, ubuntu 20.04 , x86_64 , with a 10 year old low-quality slow HDD.
installed cmdstan 2.28.1 from github into a ubuntu:20.04 docker container
installed clang++-11
installed linux perf (complication: version of perf installed in container must exactly match with the linux kernel of the host machine outside of the container)
compiled the same example model in this thread, ran it with N=1,000. it ran single threaded on one cpu core
time ./bench sample data file=bench-data.json output file=/dev/null
<snipped chatter from stan>
real 0m28.639s
user 0m27.806s
sys 0m0.816s
time ./bench sample data file=bench-data.json
<snipped chatter from stan>
real 0m28.697s
user 0m27.683s
sys 0m0.940s
make STAN_CPP_OPTIMS=true PRECOMPILED_HEADERS=false examples/bench/bench
time ./bench sample data file=bench-data.json
<snipped chatter from stan>
real 0m28.709s
user 0m27.690s
sys 0m0.960s
time ./bench sample data file=bench-data.json output file=/dev/null
<snipped chatter from stan>
real 0m28.682s
user 0m27.971s
sys 0m0.692s
perf record ./bench sample data file=bench-data.json
perf report
Overhead Command Shared Object Symbol
9.18% bench bench [.] stan::math::var_value<double, void>::var_value<double, (void*)0>
4.60% bench bench [.] stan::mcmc::base_nuts<stan::model::model_base, stan::mcmc::diag_
4.23% bench bench [.] stan::mcmc::ps_point::operator=
3.55% bench bench [.] boost::random::detail::unit_normal_distribution<double>::operato
3.36% bench bench [.] stan::mcmc::base_nuts<stan::model::model_base, stan::mcmc::diag_
2.45% bench bench [.] stan::math::gradient<stan::model::model_functional<stan::model::
1.70% bench bench [.] stan::math::normal_lpdf<true, Eigen::Matrix<stan::math::var_valu
1.44% bench bench [.] stan::math::internal::callback_vari<double, stan::math::operands
1.35% bench (deleted) [.] 0x0000000000060d9c
1.29% bench bench [.] stan::math::to_ref<Eigen::CwiseUnaryOp<stan::math::value_of<Eige
1.19% bench bench [.] stan::mcmc::diag_e_metric<stan::model::model_base, boost::random
1.19% bench (deleted) [.] 0x000000000005eb3c
1.08% bench bench [.] stan::model::model_base_crtp<bench_model_namespace::bench_model>
1.08% bench bench [.] Eigen::Matrix<double, -1, 1, 0, -1, 1>::Matrix<Eigen::CwiseBinar
1.08% bench (deleted) [.] 0x0000000000060d7c
1.06% bench bench [.] stan::mcmc::base_nuts<stan::model::model_base, stan::mcmc::diag_
1.01% bench (deleted) [.] 0x0000000000060b7c
I donāt seem to see anything jumping out in the the perf report that reads like disk I/O, but (i) not sure if i know how to recognise that from perf output, and (ii) since i was running this inside a docker container via docker run, i donāt even have a good understanding of where (physically) the output csv was getting written ā perhaps docker would have been just storing the containerās filesystem in my ram (??).
I/O revamp is a perennial to-do. @mitzimorris mentioned a while ago glimmers of hope for someone to work on it. I personally think zarr is the best format Iāve seen lately.
If we stay with json parsers the simdjson library is written in c++, claims to be 4x faster than rapidjson, and is Apache 2.0 license (I donāt know if this is compatible with BSD, just that itās not GSL).
Playing with this example a bit more, I am less convinced that the I/O is at fault here, as we could limit the effect of I/O by thinning, and based on my tests, thin = 500 doesnāt affect run times much. Will investigate a bit more. This model is a bit weird for a performance test in my opinion as there is no input data.
Given that I triggered the discussion on I/O, I am still going to respond, even though this exact case is related to it.
I havenāt looked at their code yet, but the problem here is a bit different. CmdStan writes samples immediately for each sampling iteration and does not hold them in-memory till the end and then dump into the output file. The latter case is much easier to parallelize or optimize.
I think part of the problem an I/O revamp hasnāt happened is that there are too many competing formats and everyone that cares about this has its own preferred format. Libraries pop up essentially every day claiming to be the best. At least for me, this was partially the answer why I never dug deeper, because I was afraid of making a design doc and a partial implementation and then someone comes along and says āoh, why chose XY format, that is so 2018ā. I would much rather base our approach on an older format even if that meant we werenāt on the bleeding edge of things. Great R and Python support is a must, Julia support is also a big plus, etc.
I looked at simdjson before4 selecting rapidjson for the JSON input parser. The issue is it requires C++17, and Stan is still C++11, we are struggling to get support for moving to C++14 even (Moving to C++14 Ā· Issue #2489 Ā· stan-dev/math Ā· GitHub).
From my experience, input parsing is much less of an issue than output writing (with the exception of maybe CSV input parsing for standalone GQ - but that is a less commonly used feature anyways). And even writing to the disk is rarely a bottleneck, except for maybe toy models, otherwise, the gradient evaluation time will always be much bigger than the time to write to disk.
I believe reading the CSVs in R or Python is the biggest bottleneck and pain point for large numbers of parameters and many real-world use cases. So a format that is easy and fast to read in R/Python/Julia should be chosen IMHO.
not only that, each chain is writing one drawās worth of output for all algorithm and Stan program variables. any downsteam analysis wants all draws, by chain, of each variable.
and thatās just the per-iteration information. thereās also higher-level information: step size, metric - folks keep asking for the Hessian matrix - and lower-level information (per proposal info) - not to mention gradients.
(digression: weāve hacked the Stan CSV files so that comments are used for some of this - notably step size and metric - these come after the CSV header and after warmup iterations, if saved. this is an obstacle to fast CSV parsing).
we need a good design that allows for clean I/O integration with the algorithms. we donāt have one - yet.
Iāve used protobuf once and liked it, but I think itās rather geared toward low latency and small packages of not necessarily numeric data (our csv output is just a regular matrix).
I totally get this, and itās why I had been a proponent of hdf5/netcdf for the first few years (yes, itās been going that long!) of the I/O debates. But the performance/scaling through modern architecture of zarr (which btw is used behind the scenes now as the default backend by xarray/ArViz) has swayed me in the last year. And I get the sense that the ecosystem is such that older more established formats might no longer be expected to have better support in languages; certainly the state of hdf5/netcdf support in R is a bit of a mess (you can work with them, but there isnāt a mature/rock-solid consensus package to do so)
Anyway, sorry for injecting this I/O tangent into this performance regression thread, couldnāt resist š
Iāll be using protobuf shortly for work, so will be getting some deeper experience with it to inform an opinion, but my sense from working with MCMC outputs (and computations thereon) extensively for the last few years leaves me feeling that weād be leaving a lot of performance on the table if we donāt opt for an arrays-primary format. Certainly weād want something with good/flexible metadata and groupings, but arrays need to be the core of the design.
I went in and checked, with 50k parameters and 1000 samples, the total time spent writing to the CSV files is around 20 seconds, with 10k parameters itās 4 seconds. I would say that 20 seconds in a model with 50k parameters is typically negligible (outside of toy examples). Same for 4 seconds for 10k parameters.