Speed difference between rstan and cmdstan for a simple model

Hi there,

I was setting up rstan on a HPC but got V8 installation problem, so I turned to cmdstanr which was installed smoothly. It was boring while waiting for the installation so I tested the same simple model shown below with N=1,000 on my laptop, with rstan (stan 2.26.1) it took 1s, but with cmdstan (stan 2.28.1) it took 3s. When N=10,000, rstan took 35s while cmdstan took 52s. I wonder if this is because of the new diagnostic features in cmdstan? The difference seems not small, maybe I missed something and this difference should be the case. Thanks.

data {
 int N;
}

parameters {
 vector[N] a;
}

model {
 a ~ normal(0,1);
}
2 Likes

Yeah I agree that’s a pretty big difference.

@rok_cesnovar @stevebronder Are we expecting 2.28.1 (with CmdStan) to be slower than 2.26.1 (with rstan, presumably the experimental branch since the release isn’t up to 2.26.1 yet)? If so, is it supposed to be this much slower?

With a simple model like this I would say the difference comes from the input/output of rstan/cmdstan.

1 Like

Thanks for your reply @jonah. To be fair enough, I did not set seed when launching this, but I guess it may not influence the speed that much (maybe I am wrong, then the mystery is solved). When I increased N to 1M, it took 10396s for rstan and 19553s for cmdstanr (four chains without setting the same seed).

@rok_cesnovar so do you mean we should not use this simple model as a benchmark even if we increase the observation number N? I saw people testing on this in another post Script Draft for Intro Stan Channel on YouTube - #7 by Bob_Carpenter

I think a simple model like this would be dominated by printing the output to the CSV file. Increasing N just means more output, and normal_lpdf is a very simple function.

2 Likes

If it is, its definitely a suprise to us. We have done quite some performance testing as we do each release and nothing popped up. 2.27 added back range checks, which can add up to 10% if not disabled, but this model is fully vectorized.

2 Likes

I’m curious, have you looked at data.tables fwrite c code to see how they write like gb of data really fast? Are there optimizations to write to disc? Should we be testing .arrow or .feather outputs instead of csv?

2 Likes

Imo I think the answer is just getting rstan updated so we have a direct connection to the model instead of writing to and from disk all the time

4 Likes

It might be worth running the example in with a profiler and measuring which part(s) seem to be slow.

I had a go at trying to profile this myself, using cmdstan (but not rstan), with linux perf:

environment

  • host machine: linux, ubuntu 20.04 , x86_64 , with a 10 year old low-quality slow HDD.
  • installed cmdstan 2.28.1 from github into a ubuntu:20.04 docker container
  • installed clang++-11
  • installed linux perf (complication: version of perf installed in container must exactly match with the linux kernel of the host machine outside of the container)
  • compiled the same example model in this thread, ran it with N=1,000. it ran single threaded on one cpu core
time ./bench sample data file=bench-data.json output file=/dev/null

<snipped chatter from stan>

real	0m28.639s
user	0m27.806s
sys	0m0.816s

time ./bench sample data file=bench-data.json

<snipped chatter from stan>

real	0m28.697s
user	0m27.683s
sys	0m0.940s



make STAN_CPP_OPTIMS=true PRECOMPILED_HEADERS=false examples/bench/bench

time ./bench sample data file=bench-data.json

<snipped chatter from stan>

real	0m28.709s
user	0m27.690s
sys	0m0.960s

time ./bench sample data file=bench-data.json output file=/dev/null

<snipped chatter from stan>

real	0m28.682s
user	0m27.971s
sys	0m0.692s


perf record ./bench sample data file=bench-data.json

perf report

Overhead  Command  Shared Object      Symbol                                                               
   9.18%  bench    bench              [.] stan::math::var_value<double, void>::var_value<double, (void*)0>
   4.60%  bench    bench              [.] stan::mcmc::base_nuts<stan::model::model_base, stan::mcmc::diag_
   4.23%  bench    bench              [.] stan::mcmc::ps_point::operator=                                 
   3.55%  bench    bench              [.] boost::random::detail::unit_normal_distribution<double>::operato
   3.36%  bench    bench              [.] stan::mcmc::base_nuts<stan::model::model_base, stan::mcmc::diag_
   2.45%  bench    bench              [.] stan::math::gradient<stan::model::model_functional<stan::model::
   1.70%  bench    bench              [.] stan::math::normal_lpdf<true, Eigen::Matrix<stan::math::var_valu
   1.44%  bench    bench              [.] stan::math::internal::callback_vari<double, stan::math::operands
   1.35%  bench     (deleted)         [.] 0x0000000000060d9c                                              
   1.29%  bench    bench              [.] stan::math::to_ref<Eigen::CwiseUnaryOp<stan::math::value_of<Eige
   1.19%  bench    bench              [.] stan::mcmc::diag_e_metric<stan::model::model_base, boost::random
   1.19%  bench     (deleted)         [.] 0x000000000005eb3c                                              
   1.08%  bench    bench              [.] stan::model::model_base_crtp<bench_model_namespace::bench_model>
   1.08%  bench    bench              [.] Eigen::Matrix<double, -1, 1, 0, -1, 1>::Matrix<Eigen::CwiseBinar
   1.08%  bench     (deleted)         [.] 0x0000000000060d7c                                              
   1.06%  bench    bench              [.] stan::mcmc::base_nuts<stan::model::model_base, stan::mcmc::diag_
   1.01%  bench     (deleted)         [.] 0x0000000000060b7c                                              

I don’t seem to see anything jumping out in the the perf report that reads like disk I/O, but (i) not sure if i know how to recognise that from perf output, and (ii) since i was running this inside a docker container via docker run, i don’t even have a good understanding of where (physically) the output csv was getting written – perhaps docker would have been just storing the container’s filesystem in my ram (??).

I/O revamp is a perennial to-do. @mitzimorris mentioned a while ago glimmers of hope for someone to work on it. I personally think zarr is the best format I’ve seen lately.

If we stay with json parsers the simdjson library is written in c++, claims to be 4x faster than rapidjson, and is Apache 2.0 license (I don’t know if this is compatible with BSD, just that it’s not GSL).

Playing with this example a bit more, I am less convinced that the I/O is at fault here, as we could limit the effect of I/O by thinning, and based on my tests, thin = 500 doesn’t affect run times much. Will investigate a bit more. This model is a bit weird for a performance test in my opinion as there is no input data.

Given that I triggered the discussion on I/O, I am still going to respond, even though this exact case is related to it.

I haven’t looked at their code yet, but the problem here is a bit different. CmdStan writes samples immediately for each sampling iteration and does not hold them in-memory till the end and then dump into the output file. The latter case is much easier to parallelize or optimize.

I think part of the problem an I/O revamp hasn’t happened is that there are too many competing formats and everyone that cares about this has its own preferred format. Libraries pop up essentially every day claiming to be the best. At least for me, this was partially the answer why I never dug deeper, because I was afraid of making a design doc and a partial implementation and then someone comes along and says “oh, why chose XY format, that is so 2018”. I would much rather base our approach on an older format even if that meant we weren’t on the bleeding edge of things. Great R and Python support is a must, Julia support is also a big plus, etc.

I looked at simdjson before4 selecting rapidjson for the JSON input parser. The issue is it requires C++17, and Stan is still C++11, we are struggling to get support for moving to C++14 even (Moving to C++14 · Issue #2489 · stan-dev/math · GitHub).

From my experience, input parsing is much less of an issue than output writing (with the exception of maybe CSV input parsing for standalone GQ - but that is a less commonly used feature anyways). And even writing to the disk is rarely a bottleneck, except for maybe toy models, otherwise, the gradient evaluation time will always be much bigger than the time to write to disk.

I believe reading the CSVs in R or Python is the biggest bottleneck and pain point for large numbers of parameters and many real-world use cases. So a format that is easy and fast to read in R/Python/Julia should be chosen IMHO.

1 Like

How do things look if we compare the same Stan version? So downgrade cmdstan to 2.26 and then compare is what I would do to start.

1 Like

Seems to be a performance issue…

For the model in the top comment with 5 seeds and 5 runs for each seed:

It also isn’t range checks.

3 Likes

what Rok says.

not only that, each chain is writing one draw’s worth of output for all algorithm and Stan program variables. any downsteam analysis wants all draws, by chain, of each variable.

and that’s just the per-iteration information. there’s also higher-level information: step size, metric - folks keep asking for the Hessian matrix - and lower-level information (per proposal info) - not to mention gradients.

(digression: we’ve hacked the Stan CSV files so that comments are used for some of this - notably step size and metric - these come after the CSV header and after warmup iterations, if saved. this is an obstacle to fast CSV parsing).

we need a good design that allows for clean I/O integration with the algorithms. we don’t have one - yet.

1 Like

I’ve used protobuf once and liked it, but I think it’s rather geared toward low latency and small packages of not necessarily numeric data (our csv output is just a regular matrix).

2 Likes

I totally get this, and it’s why I had been a proponent of hdf5/netcdf for the first few years (yes, it’s been going that long!) of the I/O debates. But the performance/scaling through modern architecture of zarr (which btw is used behind the scenes now as the default backend by xarray/ArViz) has swayed me in the last year. And I get the sense that the ecosystem is such that older more established formats might no longer be expected to have better support in languages; certainly the state of hdf5/netcdf support in R is a bit of a mess (you can work with them, but there isn’t a mature/rock-solid consensus package to do so)

Anyway, sorry for injecting this I/O tangent into this performance regression thread, couldn’t resist 😋

1 Like

I’ll be using protobuf shortly for work, so will be getting some deeper experience with it to inform an opinion, but my sense from working with MCMC outputs (and computations thereon) extensively for the last few years leaves me feeling that we’d be leaving a lot of performance on the table if we don’t opt for an arrays-primary format. Certainly we’d want something with good/flexible metadata and groupings, but arrays need to be the core of the design.

2 Likes

I went in and checked, with 50k parameters and 1000 samples, the total time spent writing to the CSV files is around 20 seconds, with 10k parameters it’s 4 seconds. I would say that 20 seconds in a model with 50k parameters is typically negligible (outside of toy examples). Same for 4 seconds for 10k parameters.

1 Like